Roy Longbottom at Linkedin Windows and Linux CPU, Cache and RAM PC Benchmarks

Roy Longbottom

Contents


Summary

MemSpeed Benchmark Windows MemSpeed Results
MemSpeed L1 Cache Speeds MemSpeed L2/L3 Cache Speeds MemSpeed RAM Speeds
BusSpeed Benchmark Windows BusSpeed Results
BusSpeed L1 Cache Speeds BusSpeed L2/L3 Cache Speeds BusSpeed RAM Speeds
RandMem Benchmark Windows RandMem Results
RandMem L1 Cache Speeds RandMem L2/L3 Cache Speeds RandMem RAM Speeds
SSEfpu Benchmark Windows SSEfpu Results
SSEfpu L1 Cache Speeds SSEfpu L2/L3 Cache Speeds SSEfpu RAM Speeds
FFT Benchmarks More FFT and Graph
FFTGraf Version 1 FFTGraf Version 2 FFTGraf Version 3
Linux Benchmarks Linux MemSpeed Benchmark Linux BusSpeed Benchmark
Linux RandMem Benchmark Linux SSEfpu Benchmark Linux FFT Benchmarks


Summary

These benchmarks provide performance measurements over a wide range of data sizes, covering all caches and RAM, using different processing scenarios. In many cases, the programs have been compiled for both 32 bit and 64 bit systems. They emphasise the danger of comparing computer system performance by using a single number. The latter option was included in the old Whetstone_Benchmark, where comparisons of the 9 separate tests from a 2013 3900 MHz Core i7, with a 1992 66 MHz 80486, showed an average performance improvement of 239 times with a range 160 to 336 times (MHz ratio 59x). In the case of these memory tests, average and maximum improvements can be more than 1000 and 3000 times, the additional contributory factors being increased cache sizes and operating speed.

The benchmarks are as follows. In each case performance is measured in MBytes per second:

MemSpeed - carries out three different sets of single and double precision floating point and integer calculations via two data arrays, the Windows version using assembly code instructions. The Linux version uses compiled C code, with a variation in some calculations, enabling 32 bit and 64 bit varieties to be provided. The norm floating point operation for the latter being SSE type SIMD instructions, with up to four simultaneous calculations. This produced a respectable 6 single precision GFLOPS, on a 3.9 GHz CPU, and more than 14 GFLOPS, compiled using the AVX1 directive.

BusSpeed - The benchmark is intended to demonstrate maximum data transfer speeds from buses, caches and RAM, using 32 or 64 bit integer words and data into 64 bit MMX or 128 bit SSE registers. On the latest PCs, use of multiple cores appears to be required, to achieve this goal. Reading starts by reading one word, with a large address increment for the next one, the increment being reduced by a half for following measurements, until all data is read. This identifies where data is read in bursts and provides a means of estimating bus and maximum RAM (or cache) speed. Reading all data is shown to take place at up to nearly 4 MIPS/MHz on the fastest PC tested, where multiple programs also indicated that RAM was working at 85% of specified maximum speed.

RandMem - Serial and random address selections are employed by this benchmark, using the same complex integer based indexing, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. The main purpose is to show the difference between serial and random data transfer speed, where that for the latter is considerably reduced by burst reading or writing, in turn affected by data size. The full example shown shows serial reading at up to 28 times faster than that with random access.

SSEfpu - This carries out floating point calculations, similar to MemSpeed, to compare data transfer speeds, and associated MFLOPS, between two at a time SSE2 double precision, four at a time SSSE2 and single word calculations. GFLOPS obtained by that 3.9 GHz CPU were up to 5.1, 10.2 and 4.9 respectively. A later version for Linux included code that leads to linked multiply and add operation to produce up to eight floating point operations per clock cycle, or 31.2 GFLOPS on the 3.9 GHz CPU, the benchmark demonstrating 25 GFLOPS.

FFT Benchmarks - Three versions were produced, the first being the original C code, the second with further optimised assembly language and the third using SSE SIMD instructions. The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run a number of times to identify variance, with results in milliseconds. The latest replaces the last two with an extensively modified C program. Memory used varies between 16 KB and 52 MB. The programs use skipped sequential memory access, making them susceptible to burst data transfer degradation. Reiterating earlier Core i7 performance advantage over 80486, the second version provided gains between 939 and 1321 times.

Note - This document was converted by Winnovative Free HTML to PDF Converter to include in my ResearchGate material.

Go To Start


MemSpeed Benchmark

MemSpd2K is a full Windows benchmark that employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers, via two data arrays:

    Sum to register   r = r + x [m] * y[m] (Integer + y [m])
    Sum to memory     x[m] = x[m] + y[m]                    
    Memory to memory  x[m] = y[m]                           

These are executed from assembly code which uses the same instructions as the original command line driven MemSpeed benchmark. The memory loading speed is calculated in terms of millions of bytes per second (MB/S). Measurements are made at 4000, 8000, 1600 etc. memory bytes up to 25% of the main RAM size to produce speed ratings via data from different levels of cache and from RAM. A pre-compiled version of the benchmark can be found in MemSpd2K.zip which also contains the source code, providing further explanatory comments. MemSpeed can be found in DOSTests.zip - file MDTRDOS.exe. The benchmark has also been run on other platforms. Results are available from the following - Android, Raspberry Pi and PC Linux.

The following is an example results log file. Conversion factors for MFLOPS and Integer MIPS are shown at the bottom. For floating point, double precision and single precision arithmetic speeds tend to be the same, unless limited by memory speed, and this is not the general case here. The complex instruction set used for assembly includes such as adding to registers directly from memory, rather than separate load and add instructions. This reduces the instruction count, providing more MIPS per MegaByte of data transferred.

    Core i7 4820K mainly running at 3.9 GHz using Turbo Boost
             1600 MHz RAM over 4 channels, Windows 8.1

    Memory  s=s+x[m]*y[m] Int+     x[m]=x[m]+y[m]         x[m]=y[m]
    KBytes  Dble   Sngl   Int    Dble   Sngl   Int    Dble   Sngl   Int
     Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

 L1     4  20385  10312  15486  27497  13882  20125  24835  12456  12347
        8  20757  10406  15540  27765  13895  20401  24914  12487  12478
       16  20805  10419  15622  27649  13900  20376  25014  12502  12507
       32  20837  10409  15620  27787  13823  20378  24112  12498  12478
 L2    64  20846  10427  15623  27493  13888  20145  22644  12260  12243
      128  20852  10379  15558  27515  13897  20187  23479  12405  12211
      256  20829  10380  15627  27003  13866  19975  21146  12058  11995
 L3   512  20826  10424  15593  24664  13885  18948  14964  10811  10615
     1024  20750  10423  15630  24762  13824  18879  15122  10721  10625
     2048  20754  10423  15633  24771  13887  18911  14962  10781  10568
     4096  20481  10330  15317  23879  13497  18610  14938  10672  10532
     8192  19555  10025  14849  22361  13060  17826  13360  10250  10170
RAM 16384  16340   9264  13242  14314  11028  13304   7333   7599   7581
    32768  15927   9333  12964  13882  10785  12863   6995   7365   7343
    65536  16036   9226  13031  13939  10932  12893   7006   7388   7380
   131072  16251   9371  13134  13946  10989  13076   7089   7405   7375
   262144  16287   9391  13067  13971  10984  12991   7100   7368   7363
   524288  15998   9335  12869  13904  10959  12956   7060   7294   7353
  1048576  16357   9386  13032  13957  10981  12999   7081   7376   7391

Max        20852  10427  15633  27787  13900  20401  25014  12502  12507
FP Divide by   8      4            16      8                               
MFLOPS      2607   2607          1737   1738                            
Int Divide by             2.91                 2.29                 1.45
MIPS                      5372                 8909                 8626

Maximum RAM speed 800 MHz x 2 DDR x 8 bus width x 4 channels = 51.2 GB/sec
Multiple cores need to be used for a higher throughput from RAM 
    

Go To Start


Windows MemSpeed Results

Results below are a selection from those in memspd2k results.htm. These include MemSpeed results for early PCs, where, as demonstrated by those for 100 MHz Pentium, are normally very similar to the later benchmark. The exception was the Pentium 4 (see Slow below), where speeds can be slower on reading data in caches than that from main memory. In this case, two arrays are allocated with addresses in multiples of 2048 bytes apart and this appears to identify a design limitation with the Intel P4 CPU (and the version of Windows?), where inappropriate cache flushing is applied. This problem appears to have been rectified on later P4 CPUs (see P4E), but SSE3DNow Benchmark results are the preferred option, as it uses the same calculations to measure performance.

Below are separate speeds for data in L1 cache, L2 cache, L3 cache and RAM, for PCs over 22 years from 1991.

MemSpd2K L1 Cache Results in MBytes/Second

s=s+x[m]*y[m] x[m]=x[m]+y[m] x[m]=y[m] CPU MHz Dble Sngl Int Dble Sngl Int Dble Sngl Int AMD 80386 Not2K 40 7 4 16 5 3 11 4 2 9 80486 Not2K 66 37 20 71 34 18 64 29 16 30 Pentium Not2K 100 267 148 170 313 162 197 105 53 53 Pentium 100 220 122 145 296 149 156 113 53 52 Pentium Pro 200 892 482 559 896 487 697 782 394 350 Pentium MMX 200 577 303 374 667 335 424 355 153 172 Celeron A 300 1340 725 861 1348 731 1031 1172 590 526 Celeron 2 600 2704 1455 1734 2714 1467 2074 2366 1186 1058 Pentium II 450 2025 1049 1298 2039 1099 1184 1760 892 794 Pentium III 450 1954 1066 1258 1969 1073 1536 1720 862 768 Pentium IIIE 600 2688 1457 1550 2701 1463 2070 2354 1185 1053 Pentium IIIEB 800 3598 1950 2311 3610 1959 2763 3150 1587 1410 PIII Tualatin 1266 6102 2865 4454 6164 2875 4410 5675 2863 2502 Celeron M 1295 6524 3300 4495 8943 4425 5087 6711 3329 3395 Pentium M 1862 9691 4671 6505 12896 6495 6893 9230 4397 4884 Pentium 4 Not2K 1900 5689 2852 6320 9433 4000 5125 4769 2627 3466 Pentium 4 Slow 1900 1740 344 159 2657 1523 1292 4138 1803 1547 Pentium 4N Slow 2533 2350 451 222 3919 2060 1736 5353 2214 1767 Pentium 4N 2533 6490 2989 1761 6716 2527 2360 5286 2075 1728 Pentium 4E 3000 7830 3885 11355 13096 5704 6472 8560 5469 5737 Atom M 1600 3337 1776 4577 1869 941 4188 1322 669 2094 Core 2 Duo M 1830 9210 4687 6396 12591 5301 7036 11307 3597 5561 Celeron C2 M 2000 10405 5002 7373 13858 6202 7322 12357 4005 6198 Core 2 Duo 1 CP 2400 12556 6122 8921 16749 7683 9510 14924 4667 7536 Core i5 2467M 2467 11480 5804 8847 15882 7783 11275 13364 6873 7021 Core i7 4820K 3900 20757 10406 15540 27765 13895 20401 24914 12487 12478 Cyrix M300 225 242 143 296 260 130 284 215 108 148 AMD K62 450 977 524 1573 790 395 1419 588 323 1156 AMD K63 400 864 469 1399 700 354 1267 451 233 340 Duron 700 2733 1379 2756 4513 2441 2958 3193 1573 1993 Athlon 550 2145 1084 1873 3637 1902 2310 2494 1234 1549 Athlon Tbird 1000 3913 1980 3952 6767 3507 4243 4575 2244 2841 Athlon 4 1533 5916 3036 6065 10590 5382 6937 7302 3733 4327 Ath4 Barton 1800 6779 3538 7054 12488 6252 8087 8506 4345 5111 Turion 64 M 1900 7325 3736 7513 14610 6430 8063 9224 4598 4879 Athlon XP 2080 8148 4114 8229 14479 7313 9412 9892 5057 5950 Opteron 2000 7869 3941 7892 15565 7005 8430 9884 4969 5911 Athlon 64 2210 8656 4379 8601 17241 7771 9207 10801 5525 6636 Phenom II 3000 11821 5930 11931 21669 11792 13040 14829 7317 8731
Go To Start


MemSpd2K L2 and L3 Cache Results in MBytes/Second

s=s+x[m]*y[m] x[m]=x[m]+y[m] x[m]=y[m] CPU MHz Dble Sngl Int Dble Sngl Int Dble Sngl Int 80486 Not2K 66 25 15 29 20 13 27 14 10 18 Pentium Not2K 100 121 89 100 93 74 82 60 37 43 Pentium 100 105 76 87 111 85 84 94 46 46 Pentium Pro 200 667 436 553 377 346 325 286 240 229 Pentium MMX 200 235 170 202 158 143 158 101 68 73 Celeron A 300 909 620 756 747 560 649 402 362 324 Celeron 2 600 2784 1359 1727 2143 1067 1388 1312 929 942 Pentium II 450 1188 656 715 525 393 521 275 241 220 Pentium III 450 1229 657 733 532 434 561 292 251 285 Pentium IIIE 600 2406 1315 1645 2127 1184 1380 1154 919 887 Pentium IIIEB 800 3710 1821 2317 2870 1449 1857 1747 1300 1258 Pentium IIIEB 1000 4626 2267 2888 3568 1815 2309 2170 1623 1532 PIII Tualatin 1266 5743 2935 3505 5073 2452 2939 2869 2034 1966 Celeron M 1295 6462 3333 3543 4760 3432 3234 3427 2556 2450 Pentium M 1862 9278 4792 5127 6777 4935 4694 4272 3541 3702 Pentium 4 Not2K 1900 5896 2865 3712 7529 3523 4650 3893 2151 2942 Pentium 4 Slow 1900 1719 1022 90 2389 1261 1170 3153 1554 1267 Pentium 4N Slow 2533 2034 1669 125 3537 1577 1536 5461 1805 1838 Pentium 4N 2533 6381 2935 1764 5900 2365 2326 5345 2067 1643 Pentium 4E 3000 7644 3856 4334 8084 4734 6581 6336 4062 4527 Atom M 1600 2651 1585 3301 1805 914 2972 1338 669 1437 Core 2 Duo M 1830 9357 4725 6168 8651 5609 5872 5943 3760 3807 Celeron C2 M 2000 10581 5289 6996 9569 6291 6564 6529 3905 3799 Core 2 Duo 1 CP 2400 12755 6380 8463 11561 7578 7928 7798 5328 5349 Core i5 2467M 2467 11709 5977 8932 15714 7518 11419 13219 6796 7062 Core i7 4820K 3900 20852 10379 15558 27515 13897 20187 23479 12405 12211 Cyrix M300 225 175 115 208 172 104 173 110 90 98 AMD K62 450 434 313 465 292 216 307 175 172 172 AMD K63 400 674 364 747 539 305 702 424 227 317 Duron 700 1477 1073 1007 1373 806 901 947 637 570 Athlon 550 772 640 639 693 469 559 447 378 345 Athlon Tbird 1000 2636 1792 1661 2089 1237 1373 1484 974 876 Athlon 4 1533 3565 2685 2609 3119 1866 2099 2102 1539 1349 Ath4 Barton 1800 3985 3068 2958 3563 2202 2532 2439 1849 1569 Turion 64 M 1900 4603 3554 3595 3601 2088 3139 2625 1807 1714 Athlon XP 2080 4663 3567 3444 4148 2614 2940 2840 2151 1823 Dual Opteron 2000 5102 3940 4089 3930 2252 2402 3305 2244 2197 Athlon 64 2210 4322 4388 4661 4883 2734 3921 3789 2507 2487 Phenom II 3000 11839 6017 11581 14976 10128 10365 8189 6680 6368 L3 Cache Phenom II 3000 8530 5808 7261 8091 6890 7350 4355 3787 3807 Core i5 2467M 2300 11759 5910 8748 14039 7464 10547 9051 6226 6195 Core i7 1 CP 3060 15391 7801 10808 11034 5524 10451 9204 5814 6495 Core i7 3820 &&&& 19730 9980 14787 22746 13090 17301 14401 10074 9885 Core i7 4820K 3900 20481 10330 15317 23879 13497 18610 14938 10672 10532
Go To Start


MemSpd2K RAM Speed Results in MBytes/Second

s=s+x[m]*y[m] x[m]=x[m]+y[m] x[m]=y[m] CPU MHz Dble Sngl Int Dble Sngl Int Dble Sngl Int AMD 80386 Not2K 40 6 4 11 4 3 8 4 2 7 80486 Not2K 66 16 12 18 11 9 12 8 7 8 Pentium Not2K 100 59 50 54 42 39 41 30 21 22 Pentium 100 60 49 54 50 44 45 41 26 27 Pentium Pro P0 200 138 134 138 100 85 89 49 51 49 Pentium MMX P0 200 130 107 118 99 81 84 74 56 59 Celeron A P0 300 347 195 230 189 133 142 96 95 95 Celeron 2 P0 600 418 239 309 255 163 166 137 123 127 Pentium II P1 450 492 253 305 270 171 187 142 142 142 Pentium III P1 450 503 235 335 300 199 198 161 160 163 Pentium IIIE P1 600 404 305 308 241 152 161 153 124 127 Pentium IIIEB P2 800 771 434 551 313 224 222 157 152 152 PIII Tualatin P2 1266 663 630 630 370 368 364 185 188 186 Celeron M 1295 1431 1340 1349 868 814 809 447 446 437 Pentium M DC1 1862 2473 2682 2770 1369 1373 1370 711 697 693 Pentium 4 Not2K P2 1900 843 839 832 544 552 551 273 277 277 Pentium 4 P2 1900 822 662 301 578 511 567 295 298 297 Pentium 4N R2 2533 3052 2407 1650 1643 1582 1566 872 855 847 Pentium 4N DC1 2533 2304 2001 1632 1339 1280 1271 683 673 666 Pentium 4E DC2 3000 3430 3052 3115 2323 2310 2219 1141 1123 1135 Atom M DCC 1600 2334 1476 2827 1568 884 1693 900 653 901 Core 2 Duo M DC4 1830 3939 3794 3924 2476 2373 2398 1257 1206 1171 Celeron C2 M DC3 2000 3129 3010 2786 1898 1905 1872 946 953 959 Core 2 Duo * DC3 2400 4816 4761 4828 3068 3085 3081 1568 1539 1547 Cor i5 2467M DC7 2467 9970 5532 7847 9589 6939 8323 5111 4682 4640 Core i7 4820 DC8 4c 3900 16287 9391 13067 13971 10984 12991 7100 7368 7363 Cyrix M300 225 100 78 112 83 65 84 45 42 43 AMD K62 P1 450 219 191 217 142 123 138 74 73 72 AMD K63 P1 400 165 157 171 108 98 110 84 50 50 Duron P2 700 413 264 265 270 217 229 252 171 163 Athlon P1 550 296 260 259 247 211 223 186 140 140 Athlon Tbird P2 1000 313 293 292 303 235 281 238 177 170 Athlon 4 D1 1533 790 764 756 766 674 719 422 393 367 Ath4 Barton #D1 1800 569 562 557 397 397 398 200 198 187 Turion 64 M DC3 1900 2517 2453 2566 2078 1808 1781 1147 1101 1008 Athlon XP D2 2080 1221 1218 1169 960 896 927 523 487 456 Opteron D3 2000 2338 2347 2360 2182 1818 1783 1171 1089 1097 Athlon 64 DC2 2210 3023 2962 2942 2004 1952 1975 1076 1047 980 Phenom II DC7 3000 4993 4146 4240 4285 3890 4215 2323 2100 2087 Key P1 100 MHz P2 133 MHz D1 DDR 133 MHz D2 DDR 166 MHz D3 DDR 200 MHz DC1 Dual Channel DDR 133 MHz DC2 Dual Channel DDR 200 MHz DC3 DDR2 533 MHz DC4 DDR2 666 MHz DC5 DDR2 800 MHz DC6 DDR3 1066 MHz DCC DDR2 533 MHz 1 channel DC7 DDR3 1333 MHz DC8 DDR3 1600 MHz 1 2 4 channels R1 RDRAM 400 MHz R2 RDRAM 533 MHz # Slow speed examples (Ath4 slow chipset, Core 2 Duo slow nForce 570 chipset) * Core 2 Duo Intel 965 chipset M = Mobile


Go To Start


BusSpeed Benchmark

BusSpd2K benchmark is intended to demonstrate maximum data transfer rates from caches and RAM using 32 bit integer words and 64 bit MMX words. MOV and AND assembly code instructions are used, with 64 instructions in the inner loops for integers and 512 instructions for MMX. The program measures speeds with data size 4, 8, 16, 32 etc. KBytes up to a maximum of 50% RAM size. Results are given in MBytes/second (MB/s), where M = 1,000,000. An approximation of processor execution speed in Millions of Instructions Per Second (MIPS) can be obtained by dividing MB/s for integer tests by 4 and those for MMX tests by 8.

Ten different tests load data to one CPU register or 2 registers alternately (MMX 1 or 8). Tests 5 and 6 use MOV instructions to 1 and 2 integer registers, with tests 7 and 8 the same except using AND. These identify differences between CPU models. Tests 9 and 10 use MMX MOV to 1 and 8 registers, normally demonstrating maximum data transfer speeds. Tests 1 to 4 load a 32 bit word (4 bytes) with address increments of 64, 32, 16, 8 bytes respectively. These are intended to demonstrate bus operation and speed where data is transferred in bursts.

A pre-compiled version of the benchmark can be found in busspd2k.zip which also contains the source code, providing further explanatory comments. The benchmark has also been run on other platforms. Results are available from the following - Android, and Raspberry Pi, already available at ResearchGate. Later, the intention is to upload further reports for Linux, multi-core and stress testing versions. A summary of further details can be found in busspd2k results.htm.

The following represents the best performance that could be expected on a May 2014 desktop, assuming no overclocking. Following are some single thread results from the later multithreading version, that has different memory address increments and includes SSE2 functions, instead of MMX. There is also a 64 bit compilation that also uses 64 bit integers. Here, measured MB/second can be twice as high as the 32 bit program, implying the same execution time using larger registers. The later version also comprises all compiled C code, using long sequences of AND functions.

Core i7 4820K mainly running at 3.9 GHz using Turbo Boost
32 GB 1600 MHz RAM over 4 channels, Windows 8.1

          MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
  Memory  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
  KBytes Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8
   Used   MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S

L1    4  14612 26366 28608 29964 30626 30592 15551 29027 60254 60322
      8  15197 28712 29947 30605 30976 30982 15600 29350 61449 61471
     16  15295 29857 30563 30945 31100 31108 15611 29461 61980 61989
     32  14584 24315 28156 29773 30122 30111 15602 29073 55568 55455
L2   64   7032 12156 17523 22653 26520 26556 15624 26426 29089 29155
    128   7638 12224 17650 22580 26579 26564 15635 26468 29210 29137
    256   6983 11598 17081 22241 26317 26326 15614 26259 28745 28769
L3  512   2797  5461  9755 16920 24858 24859 15631 24818 26636 26638
   1024   2744  5378  9685 16783 24756 24730 15622 24729 26514 26529
   2048   2747  5371  9676 16744 24722 24705 15624 24621 26457 26471
   4096   2739  5346  9636 16370 24336 24341 15633 24289 25899 25864
   8192   2365  4557  8479 14462 21502 21522 15083 21488 23089 23097
R 16384    969  2113  4167  8377 13906 13890 13099 13922 14502 14519
  32768    928  2003  3887  8154 13591 13587 13046 13593 14045 14045
  65536    931  2011  3905  8206 13639 13624 13076 13632 14075 14095
 131072    944  2055  3914  8276 13672 13670 13138 13701 14141 14146
 262144    945  2055  3920  8305 13709 13686 13110 13701 14136 14151
 524288    933  2024  3918  8225 13666 13657 13101 13648 14107 14117
1048576    945  2059  3919  8276 13681 13670 13124 13696 14132 14137

R = RAM                                                             

Maximum speed 800 MHz x 2 DDR x 8 bus width x 4 channels = 51.2 GB/sec
Multiple cores need to be used for a higher throughput from RAM 

          Later Multithreading Version, Single Thread Results

           Inc     Inc     Inc     Inc     Inc    Read    128b
         32wds   16wds    8wds    4wds    2wds     All    SSE2
 32 Bit                                                       
 L1      15642   15642   22493   21590   21709   21375   61610
 L2       2782    2904    5623    9806   17348   20363   40673
 RAM       644     934    1994    3842    8098   13852   15963
 64 Bit ##                                                    
 L1      31565   31291   31178   42042   42508   41978   61606
 L2       5375    5559    5793   11083   20009   34332   40516
 RAM      1034    1272    1866    4023    7724   16029   15980

        ## 64 bit wds                                         

   Example 16 32b words = Inc64B and 8 64b words = Inc64B

To Start


Windows Bus Speed Results

On loading registers with varying address increments, the size of a burst of data over a bus can be recognised as the point when data transfer speed becomes constant, for example, 32 bytes (8 words) on the Celeron A below, and 64 bytes (16 words) on the others. Maximum possible bus burst data transfer speed can be estimated from these, as 62 x 8 MB/second for the Celeron and 62 x 16 for the one below. Then, the multithreading results above suggest even larger bursts, particularly using 64 bit words.

Theoretical maximum data transfer speeds, for more modern PCs, are calculated as 8 (bus width) x bus MHz x 2 (Double Data Rate) x number of channels, 8 x 800 x 2 x 4 = 51.2 GB/second for the Core i7, then 667 MHz and 2 channels for the Phenom (666.6 x 32) and 400 MHz and 2 channels (400 x 32) for the Core 2 Duo.

With the 8 byte wide bus, 8 data transfers are required for a 64 byte burst, or 4 clock pulses using DDR. Then there is confusion regarding data transfer startup time, CAS latency, that is 9 clocks for both the Core i7 and Phenom RAM. However, this can be overlapped with continuous data transfers. The latter is influenced by how fast the CPU can handle the data and it is clear that multiple cores might be required.

Examples of multi core use are below. Multithreading has its own inherent overheads, demonstrating 71% efficiency on burst reading and 61% ANDing all data on the Phenom, with Core i7 some 10% better. The 4 separate programs on the Core i7 are shown to achieve 85% of the specified maximum speed.

                                  Single CPU Core Tests
                              
                          MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
  CPU          Max   Max  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
               Bus Burst Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8
              MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S

  Celeron A    800   496    62    62   122   246   406   408   425   427   494   492
  Duron       1067   992    62   123   244   496   682   681   515   515   947   946
  Pentium 4   1067  1008    63   128   259   500   969   942   954   969   997  1000
  Core 2 Duo 12800  6224   389   794  1448  2691  5020  5023  4884  4813  5617  5657
  Phenom II  21333  7392   462   901  1771  3443  5380  5355  5225  5240  6934  6936
  Core i7    51200 15120   945  2059  3919  8276 13681 13670 13124 13696 14132 14137

                                  Multiple Core Tests

  Phenom 4 Threads 15152                                           13000            
  Corei7 4 Threads 42112                                           35915            
  Corei7 4 programs                                                43547            
   


To Start


BusSpd2K L1 Cache Results in MBytes/Second

MovI MovI MovI MovI MovI MovI AndI AndI MovM MovM Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8 MHz Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8 80486 DX2 66 112 117 119 124 136 122 120 123 0 0 Pentium 100 316 355 637 679 385 713 195 380 0 0 Pentium Pro 200 679 748 769 764 775 779 758 756 0 0 Pentium MMX 200 699 735 1428 1444 776 1470 393 764 1568 1567 Celeron A 450 1617 1690 1729 1677 1724 1737 1703 1700 3517 3507 Pentium II 450 1515 1649 1745 1728 1763 1765 1710 1717 3520 3527 Pentium III 450 1624 1700 1738 1735 1759 1740 1742 1744 3513 3491 AMD K62 500 1633 1742 1685 1706 1783 1780 1725 1755 3593 3509 Celeron 2 566 2042 2136 2198 2184 2213 2210 2191 2194 4446 4445 Duron 700 2530 4807 5011 4985 4941 5034 2677 4935 10169 10151 Pentium IIIE 733 2652 2778 2852 2833 2849 2869 2844 2847 5768 5765 Athlon 800 2917 5508 5754 5735 5791 5909 3135 5777 11951 11888 Athlon Tbird 1000 3635 6738 6331 7084 7245 7391 3830 7224 14871 14808 PIII Tualatin 1266 4588 4802 4914 4862 4949 4967 4806 4804 9898 9897 Atom M 1600 5447 5628 5792 5901 5988 5973 5984 5970 12268 12480 Athlon 4 1533 5594 10379 11155 11496 11161 11173 6016 11122 22870 22845 Pentium 4 1700 6139 6343 6559 6639 6589 6540 6405 6428 13188 13276 Ath4 Barton 1800 6525 12165 12999 13367 13020 12917 7013 11689 26630 26632 Core 2 Duo M 1830 6700 6879 7061 7200 7251 7215 7214 7249 14461 14448 Pentium M 1862 6744 6875 7117 7320 7371 7381 7255 7374 14424 14658 Turion 64 1900 6872 13397 14159 14798 14190 14094 7294 14098 29407 29390 Opteron 2000 7237 14069 14802 15473 14822 15008 7783 14726 30715 30698 Celeron C2 M 2000 7362 7622 7792 7845 7877 7800 7597 7911 15124 15648 Athlon XP 2080 7585 14104 15097 15617 15011 14995 7982 15009 31009 30959 P4 Xeon 2200 7947 8301 8495 8593 8559 8561 8336 8414 17176 17184 Athlon 64 2210 8070 15711 16498 17247 16538 16763 8670 16454 34291 34254 Core i5 2467M 2300 8166 15146 17474 17504 17348 18026 8298 17087 35822 35258 Core 2 Duo 2400 8640 8820 9339 9451 9530 9530 9477 9523 18930 18909 Pentium 4E HT 3000 9686 11043 11233 11525 11562 11227 11054 11099 22804 22657 Pentium 4 3000 10915 11458 11710 11853 11784 11790 11426 11238 23589 23500 Phenom II 3000 22764 22849 23433 23768 23938 23934 12019 22553 46887 46911 Core i7 930 3066 11251 11488 11620 11614 11712 11719 5873 11718 23391 23398 Core i7 860 3466 12977 13465 13645 11701 13556 13349 6794 13742 27450 26951 Pentium 4 3678 13412 13879 14306 14358 14252 14422 13007 13473 28713 28818 Core i7 4820K 3900 15197 28712 29947 30605 30976 30982 15600 29350 61449 61471


Go To Start


BusSpd2K L2 and L3 Cache Results in MBytes/Second

MovI MovI MovI MovI MovI MovI AndI AndI MovM MovM Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8 MHz Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8 80486 DX2 66 11 11 11 17 32 31 30 30 0 0 Pentium 100 26 26 40 75 124 139 96 117 0 0 Pentium Pro 200 133 132 234 317 488 487 454 453 0 0 Pentium MMX 200 53 53 75 131 235 235 192 232 264 264 Celeron A 450 306 305 548 793 975 975 974 976 1582 1619 Pentium II 450 179 179 359 709 829 824 831 832 1428 1433 Pentium III 450 180 180 359 531 846 846 843 846 1430 1437 AMD K62 500 29 59 117 218 436 436 429 429 436 436 Celeron 2 566 532 533 1125 1205 1392 1392 1389 1392 2410 2409 Duron 700 134 270 535 1029 1932 1955 1577 1533 2050 2008 Pentium IIIE 733 697 697 1466 1568 1805 1808 1809 1809 3135 3131 Athlon 800 106 211 424 846 1697 1698 1599 1588 1693 1698 Athlon Tbird 1000 198 360 788 1584 3144 3169 2572 2469 3104 3146 PIII Tualatin 1266 1701 1575 2513 2520 2882 2864 2879 2881 5038 5034 Atom M 1600 379 739 1385 2412 3624 3690 3683 3681 4769 4718 Pentium 4 1700 2617 3077 3544 3570 4658 4656 4598 4628 7143 7117 Ath4 Barton 1800 355 713 1421 2799 4863 4851 4009 4462 5682 5622 Core 2 Duo M 1830 1597 2523 3475 5130 6227 6234 6233 6012 7950 7976 Pentium M 1862 1214 2117 3289 4031 4731 4668 4732 4749 8077 8109 Turion 64 1900 429 831 1688 2976 5490 5383 5467 5457 5939 6086 Opteron 2000 670 1296 2588 4700 7167 7163 5870 6201 9480 9542 Celeron C2 M 2000 1791 2765 3799 5516 6812 6816 6812 6805 8747 8557 Athlon XP 2080 413 828 1645 3270 5637 5601 4597 5163 6564 6567 P4 Xeon 2200 4190 4021 4577 4630 6038 6038 6010 6020 9255 9258 Athlon 64 2210 651 1285 2411 4418 7786 7776 6448 6688 8936 8718 Core i5 2467M 2300 4196 6815 9583 12742 12941 13572 8563 14720 14156 17136 Core 2 Duo 2400 2131 3257 4597 6772 8187 8196 8168 8201 10549 10559 Pentium 4E HT 3000 2945 5640 6105 6624 7526 7536 7425 7470 13097 13303 Pentium 4 3000 5912 5521 6335 6385 8338 8337 8298 8322 12762 12779 Phenom II 3000 1500 2995 5986 11360 15036 15036 11918 15233 22377 22367 Core i7 930 3066 3213 4805 7305 9467 10811 10810 5875 10805 14442 14408 Core i7 860 3466 3595 5003 8442 11028 12618 12639 6895 12408 16719 16788 Pentium 4 3678 7258 6719 7722 7808 10161 10201 10169 10064 15423 15560 Core i7 4820K 3900 7638 12224 17650 22580 26579 26564 15635 26468 29210 29137 L3 Cache Core i5 2467M 2300 1807 3499 5553 9167 14017 14395 9017 13494 15050 14363 Phenom II 3000 745 1485 2974 5881 9833 9825 9615 9603 11726 11650 Core i7 930 3066 2004 3497 5958 9088 10447 10448 5870 10447 13857 13857 Core i7 860 3466 2262 3537 6992 10641 12204 12233 6319 10478 15059 16251 Core i7 4820K 3900 2744 5378 9685 16783 24756 24730 15622 24729 26514 26529


Go To Start


BusSpd2K RAM Results in MBytes/Second

Max Max MovI MovI AndI AndI MMX System MHz bus Burst Reg1 Reg2 Reg1 Reg2 Max 80486 DX2 B 66 133 32 25 24 23 24 0 Pentium B 100 400 96 73 79 64 73 0 Celeron 2 # 900 800 168 166 166 166 165 166 Pentium MMX B 200 533 232 140 140 123 139 140 Pentium Pro 200 533 256 225 225 240 240 0 AMD K6 B 550 800 272 238 238 237 238 238 Pentium IIIEB # 1000 1067 289 289 289 289 289 289 Celeron A 300 533 456 267 267 282 280 450 Celeron A 450 800 496 407 406 426 427 494 Pentium II H 400 800 488 314 314 322 322 484 Pentium II H 450 800 504 317 316 324 325 500 Celeron 2 600 533 504 324 326 343 343 511 Pentium III H 450 800 528 303 304 339 334 527 Ath4 Barton # 1800 2133 592 589 590 433 492 594 Athlon Tbird # 1200 1067 672 528 527 351 328 670 Athlon H 800 800 672 575 575 414 366 673 Pentium IIIE 800 800 752 463 462 477 476 764 Celeron 2 850 800 784 474 474 486 486 765 Athlon H 900 1067 912 648 648 461 416 879 PIII Tualatin 1266 1067 912 580 579 580 575 749 Duron 700 1067 994 682 685 512 516 977 Pentium IIIEB R 1000 1600 1024 411 412 420 420 794 Pentium 4 2400 1067 1027 987 989 982 990 1010 Pentium IIIEB 1000 1067 1035 509 516 537 537 908 Athlon Tbird 800 1067 1040 677 677 516 510 942 Athlon Tbird 950 1067 1040 680 680 463 417 950 Duron 1000 1067 1043 680 680 463 414 951 Pentium 4 1900 1067 1043 981 980 979 967 1007 Athlon Tbird D 1466 2133 1744 755 756 666 666 1217 Pentium 4 D 1800 2133 1952 1455 1455 1401 1415 1641 Athlon Tbird D 1333 2133 1968 756 756 659 657 1219 Pentium 4 D 3066 2133 2021 1826 1819 1812 1818 1913 Athlon 4 D 1725 2400 2032 888 878 668 745 1172 Athlon XP D 2080 2667 2336 1171 1167 903 986 1549 Pentium 4 R 1700 3200 2336 1478 1471 1402 1429 1660 P4 Xeon R 2200 3200 2448 1537 1538 1511 1515 1822 Athlon 64 D 2000 3200 2932 2778 2736 2669 2663 2963 Opteron D 2000 3200 3136 2123 2129 2070 2110 2476 Pentium 4 R 2533 4267 3216 2078 2100 2075 2084 2358 Atom M D2 1600 6400 3280 3011 2958 2998 2953 3250 Pentium M DC 1862 4267 3328 2379 2375 2258 2294 2545 Core 2 Duo a DC2 2400 8533 3456 4312 4314 4194 4342 4860 Pentium 4 DC 2533 4267 3529 2576 2578 2451 2448 2742 Celeron C2 M DC2 2000 8533 3632 2550 2843 2607 3351 3493 Turion 64 M DC2 1900 8533 4112 2513 2555 2430 2484 2689 Core 2 Duo M DC2 1830 10667 4800 3738 3758 3604 3643 4464 Pentium 4E DC 3000 6400 4976 3613 3623 3432 3564 3895 Athlon 64 DC 2210 6400 4992 2793 2791 2704 2803 2941 Pentium 4 DC 3678 6272 5021 3375 3381 3249 3273 3723 Core 2 Duo b DC2 2400 8533 5376 4435 4402 4413 4342 5161 Core 2 Duo c DC2 2400 12800 6272 5051 5061 4961 4893 5720 Phenom II DC32 3000 21333 7208 5397 5393 5263 5262 6950 Core i7 DC32 3066 17067 11264 7845 7840 5410 7853 8290 Core i5 2467M DC3 2300 21333 12608 10245 9632 6570 9481 10258 Core i7 DC32 3466 21333 13600 9095 9204 6275 9421 9794 Core i7 4820K QC34 3900 51200 16472 13681 13670 13124 13696 14137 Key B L2 cache on memory bus # Example of poor results H L2 at half CPU MHz or less R RDRAM D DDR RAM DC Dual Channel DDR RAM DC2 DDR 2 DC32 DDR 3 2 Channel M Mobile CPU QC34 DDR 3 4 Channel
Go To Start


RandMem Benchmark

RandMem benchmark carries out eight tests at increasing data sizes to produce data transfer rates in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. The C/C++ program structure is as follows with array xi indexing via sequential or random numbers stored in the array.

Read - toti = toti & xi[xi[i+0]] | xi[xi[i+2] & xi[xi[i+4]] |& to i+30
Read/write - xi[xi[i+2]] = xi[xi[i+0]]; repeated to i+30 and i+28     

The main purpose is to demonstrate performance differences between sequential and random access when using the same CPU instructions, particularly the impact of burst reading (and writing) over a bus. In this case, with random access, 32 bytes or more will be read when only four are requested. (see also BusSpeed Benchmark}. Random speeds are also affected by lower level cache sizes.

A precompiled version of the benchmark can be found in randmem.zip which also contains the source code, providing further explanatory comments. Information on maximum speeds when different processing is involved can be obtained from BusSpeed Benchmark Results and SSEfpu Benchmark Results. Then randmem results.htm includes further details and comparisons, including those for multithreaded benchmark versions.

Below is an example of MB/second results from a Core i7 CPU, showing the effects of different cache sizes. Note decrease in random access speeds, due to burst reading and reducing benefits of caching.

        Core i7 4820K mainly running at 3.9 GHz using Turbo Boost
        32 GB 1600 MHz RAM over 4 channels, Windows 8.1
 
         Integer.......................  Double/Integer................
         Serial........  Random........  Serial........  Random........
    RAM   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt
     KB  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec

L1    6   24753   21240   24353   20950   27914   26690   27901   26866
     12   24674   21377   24041   20986   28277   24369   28276   27232
     24   24599   21373   24361   21586   28457   24246   28440   25932
L2   48   22414   20560   18133   12948   28389   24984   28045   22632
     96   22465   20538   13834    8952   28354   24827   22114   13686
    192   22480   20579   11814    7779   28353   24880   18659   12085
L3  384   21765   17461    7988    5917   26567   21036   14434    9949
    768   21847   17211    6070    5018   26933   19937   10299    7930
   1536   21853   17168    5439    4604   26452   20292    8886    7261
   3072   21456   16651    3263    3165   26243   20120    8286    6868
   6144   21383   16613    1607    1575   26209   20114    3338    3184
R 12288   13559   10997    1165    1137   18529   14306    2042    1965
  24576   12429   10285     926     858   16547   12810    1575    1468
  49152   12596   10358     758     702   16559   12756    1283    1192
  98304   12572   10351     603     572   16509   12777    1059    1012
 196608   12599   10363     510     492   16422   12752     834     818
 393216   12573   10368     468     454   16403   12771     733     728
 786432   12565   10383     442     429   16512   12775     687     685

R = RAM
Maximum speed 800 MHz x 2 DDR x 8 bus width x 4 channels = 51.2 GB/sec
Multiple cores need to be used for a higher throughput from RAM 
    

To Start


Windows RandMemResults

Separate tables of speeds obtained via L1 cache, L2 cache and RAM are given below. Except when connected via the memory bus, performance via caches tends to be proportional to CPU MHz for a given type of processor. So, only a sample of results are provided. Details of cache sizes, speed and range of CPU MHz can be found in PC CPU Specifications 1994 to 2014, plus Measured MIPS and MFLOPS per MHz.pdf.


RandMem L1 Cache Results in MBytes/Second

Integer Double/Integer Serial Random Serial Random CPU MHz Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt 80486 DX2 66 63 80 69 87 47 65 51 80 Pentium 100 205 243 200 233 248 301 258 281 Pentium MMX 200 439 525 434 510 565 669 564 634 Pentium Pro 200 654 308 654 470 760 662 794 681 Pentium II 450 1471 1072 1508 1077 1745 1530 1801 1495 Celeron A 450 1496 1084 1511 1084 1757 1508 1761 1485 Pentium III 450 1500 1066 1482 1034 1702 1472 1719 1499 AMD K62 500 1114 1434 1131 1356 790 1575 841 1510 Celeron 2 566 1900 1375 1908 1357 2276 1928 2263 1882 Duron 700 1582 1730 1615 1727 2819 2320 2575 2253 Pentium IIIE 733 2460 1772 2462 1751 2909 2491 2928 2437 Athlon 800 1843 2025 1918 2017 2031 2401 1893 2444 Athlon Tbird 1000 2310 2514 2360 2471 4038 3310 3687 3256 Celeron M 1295 4620 3199 4511 3152 6404 4359 6666 4383 Atom M 1600 2639 3215 2722 3213 3398 3786 3437 3838 Pentium 4 1800 6361 3421 6559 2378 6139 5687 6138 3021 Ath4 Barton 1800 4068 4290 4077 4438 7377 5960 6654 5843 Core 2 Duo M 1830 4317 7669 6611 5123 8875 9348 9316 8444 Pentium M 1862 6586 4612 6701 4584 9793 6304 9771 6288 Pentium 4 1900 6553 3667 6788 2511 6361 6188 6443 3192 Turion 64 M 1900 4691 5222 4776 4965 7891 6653 7569 6660 Opteron 2000 4514 4909 4532 4922 8063 6609 7421 6464 Celeron C2 M 2000 6884 7227 7095 5034 10163 10333 6852 7987 Athlon XP 2080 4728 5215 4755 5158 8268 6830 7618 6800 Athlon 64 2210 5554 6072 5532 6129 9772 7799 9165 7724 Core i5 2467M 2300 7800 7822 8834 7978 10059 9427 10114 10698 Core 2 Duo 1 CP 2400 8821 9518 8806 7379 12415 12690 12405 12464 Pentium 4E HT 3000 9620 5664 9840 3460 8015 7874 8894 4655 Pentium 4 3000 10397 5781 10768 3830 10230 9448 10255 4938 Core i7 3060 10809 11713 10802 12145 14813 14343 14405 15544 Phenom II 3000 12252 8269 11570 8222 15567 10000 15514 10664 Core i7 3460 12122 7425 12505 6818 16279 9503 16598 10807 Pentium 4 3678 12630 7668 13268 4703 12561 11942 12478 6096 Core i7 4820K 3900 24674 21377 24041 20986 28277 24369 28276 27232 MIPS multiply by 0.55 0.37 0.55 0.37 0.28 0.31 0.28 0.31
Go To Start


RandMem L2 and L3 Cache Results in MBytes/Second

Integer Double/Integer Serial Random Serial Random CPU MHz Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt 80486 DX2 66 23 17 11 12 22 15 12 14 Pentium 100 96 73 32 40 89 69 35 46 Pentium MMX 200 195 135 86 93 183 132 94 110 Pentium Pro 200 487 269 208 132 613 310 357 207 Pentium II 450 700 325 313 136 559 398 323 177 Celeron A 450 994 769 287 233 912 813 319 309 Pentium III 450 801 335 303 141 794 406 526 230 AMD K62 500 400 182 72 55 425 221 111 77 Celeron 2 566 1505 1222 373 356 1186 1376 388 426 Duron 700 1143 1073 678 718 1210 1313 1203 1390 Pentium IIIE 733 2060 1674 1531 993 2593 1827 2275 1479 Athlon 800 840 576 610 320 1169 864 1193 1048 Athlon Tbird 1000 1636 1543 976 1028 2520 1906 2454 2066 Celeron M 1295 3386 2678 1930 1009 4447 3183 3115 1649 Atom M 1600 2160 2306 718 944 2775 2584 1208 1455 Pentium 4 1800 4143 2129 2621 1901 6541 5023 4903 2313 Ath4 Barton 1800 2968 2819 1571 1814 4525 3220 4378 3733 Core 2 Duo M 1830 5793 6735 3061 2717 7520 6418 5412 4198 Pentium M 1862 4833 4132 2807 1458 6733 4965 4393 2371 Pentium 4 1900 5115 2215 2745 1965 6786 3036 4713 2437 Turion 64 M 1900 2804 2671 2486 2393 4426 3994 4797 4140 Opteron 2000 3128 3198 2881 2731 5222 3671 5249 4402 Celeron C2 M 2000 6213 7155 3319 3006 8788 7702 6050 4428 Athlon XP 2080 3458 3311 2054 2112 5232 3931 5083 4419 Athlon 64 2210 4070 3734 3322 3257 6140 4420 6124 5218 Core i5 2467M 2300 8593 7538 5300 3390 11588 8536 7796 5175 Core 2 Duo 1 CP 2400 7752 8989 4112 3655 10739 9632 7335 5771 Pentium 4E HT 3000 6892 3073 3482 2541 7855 4821 6899 3250 Pentium 4 3000 8104 3238 4291 3117 9936 6036 8324 3856 Core i7 3060 10156 10801 5895 5623 13359 12881 9894 9110 Phenom II 3000 10549 7860 6381 5215 15308 9662 14830 9879 Core i7 3460 11111 6666 5911 5429 13574 8977 10187 8073 Pentium 4 3678 9894 4533 5166 3785 12423 9155 9174 4396 Core i7 4820K 3900 22465 20538 13834 8952 28354 24827 22114 13686 L3 Cache at 3072 KB i5 1536 KB Phenom II 3000 7874 6680 1077 1017 9428 8358 2048 2045 Core i5 2467M 2300 7064 5632 2243 1904 10357 7834 3927 2977 Core i7 3060 9718 9846 2364 2312 12661 11345 5207 4408 Core i7 3460 9762 6331 2378 2620 14411 9396 5608 4601 Core i7 4820K 3900 21853 17168 5439 4604 26452 20292 8886 7261
Go To Start


RandMem RAM Speed Results in MBytes/Second at 6.1 MB

The selected standard 6.1 MB was chosen to provide appropriate comparisons of random access speeds that reduce as memory capacity used increases.
Integer Double/Integer Serial Random Serial Random CPU MHz Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt 80486 DX2 66 21 10 6 7 17 10 8 9 Pentium 100 55 35 11 14 52 40 19 22 Pentium MMX P0 200 121 83 23 26 111 78 31 36 Pentium Pro P0 200 129 75 32 21 142 84 53 16 AMD K62 P1 500 132 87 18 15 130 94 26 20 Celeron A P0 300 232 115 62 41 138 151 77 57 Duron P2 700 247 193 36 30 393 343 49 46 Athlon Tbird P2 1000 249 207 38 33 488 358 62 53 Celeron 2 P0 566 276 167 80 53 253 192 93 70 Athlon P2 800 250 191 38 33 371 323 54 51 Pentium II P1 450 300 152 89 61 196 194 107 79 Pentium III P1 450 329 167 98 68 350 233 169 112 Ath4 Barton D1 1800 383 265 69 48 559 343 115 77 Athlon 4 D1 1667 453 426 129 93 699 573 222 149 Pentium IIIEB P2 1000 469 257 142 97 513 344 215 156 Pentium IIIEB P2 733 474 204 96 66 391 251 123 88 Athlon XP D2 2080 884 727 183 116 1224 880 311 187 Pentium 4 P2 1900 940 387 48 42 914 483 76 64 Celeron M 1295 1029 456 89 55 1467 632 144 93 Pentium 4 R1 1400 1324 689 107 84 1123 912 159 118 Pentium 4 D1 1800 1394 630 98 80 1658 803 168 123 Opteron D3 2000 1536 1377 121 111 2297 1822 235 217 Pentium 4 D1 2533 1561 599 75 58 1623 738 118 90 Pentium 4 D1 3066 1737 655 70 51 1718 785 125 81 Turion 64 M DC3 1900 1758 1392 247 191 2222 1704 430 304 Pentium 4 R2 2533 1968 1019 172 145 2919 1352 297 220 Atom M DD2 1600 2058 1072 52 81 2283 1392 84 127 Pentium M DC1 1862 2073 787 340 213 2442 1238 616 376 Athlon 64 D3 1995 2100 965 156 122 2520 1432 291 225 Athlon 64 DC2 2210 2145 1451 248 159 3008 1785 402 254 Pentium 4 DC1 2533 2335 847 98 72 2303 978 166 114 Celeron C2 M DC3m 2000 3000 1212 302 183 3027 1455 514 311 Pentium 4 DC2 3678 3150 1850 181 124 4115 2103 294 196 Core 2 Duo M DC3M 1830 3384 1524 459 296 3349 1864 849 534 Pentium 4E HT DC2 3000 3523 1736 182 141 3569 2092 325 224 Core 2 Duo 1CP DC3b 2400 4854 2605 789 597 5532 3799 1486 1309 Core 2 Duo 1CP DC3a 2400 4947 770 349 208 1685 1052 932 557 Core 2 Duo 1CP DC3c 2400 5136 2775 878 657 6086 4041 1637 1396 Phenom II $C DC33 3000 6120 6079 747 654 9065 7991 1395 1220 Core i5 2467M DC33 2300 6127 5396 484 458 7722 6141 825 786 Core i7 $C DC32 3060 7261 5273 953 854 7008 5650 1665 1483 Core i7 $C DC33 3460 7811 5110 1071 870 8036 5998 1652 1742 Core i7 $C QC34 3900 21383 16613 1607 1575 26209 20114 3338 3184 Core i7 12.3MB QC34 3900 13559 10997 1165 1137 18529 14306 2042 1965 Maximum 13559 10997 1165 1137 18529 14306 2042 1965 Key P0 66 MHz P1 100 MHz P2 133 MHz D1 DDR 133 MHz D2 DDR 166 MHz D3 DDR 200 MHz DC1 Dual Channel DDR 133 MHz DC2 Dual Channel DDR 200 MHz DC3a DDR2 533 MHz nForce 570 chipset DC3b DDR2 533 MHz Intel 965 chipset DC3c DDR2 800 MHz Intel 965 chipset DC3M DDR2 666 MHz Mobile CPU DC3m DDR2 533 MHz Mobile CPU DC33 DDR3 1333 MHz DC32 DDR3 1066 MHz QC34 DDR3 1600 MHz 4 Channels R1/R2 RDRAM 400/533 MHz $C 6.1 MB Mainly or all L3 cache

Go To Start


SSEfpu Benchmark

SSE3DNow is a Windows benchmark that carries out similar calculations to MemSpeed, but uses floating point Single Instruction Multiple Data (SIMD) functions, via assembly code instructions, plus some tests using normal C/C++ compilations. The benchmark and source code are available in sse3dnow.zip, and further details and results are in sse3dnow results.htm. A 64 bit version is aslo available in more64bit.zip.

3DNow fuctions are only available on AMD CPUs, using MMX registers. SSE deals with four single precision numbers in 128 bit registers, also used for two at double precision with SSE2. Results are given as Millions of Bytes Per Second (MB/s) memory reading speed. On modern systems, the latter tends to be the same for SSE and SSE2 calculations, but twice the execution rate of SSE calculations.

Following is an example of logged results on a 2014 Core i7 CPU. This also shows the conversion factors for MB/second to MFLOPS. Using SSE, this processor is capable of producing four single precision results per clock cycle at 15.6 GFLOPS or eight per cycle, 31.2 GFLOPS, with linked add and multiply operation. The measured maximum here was not that good at 10.375 GFLOPS, and would need more register based calculations, within a loop, to improve the score. On the other hand, measured performance is aroung four times faster than MemSpeed.


    Core i7 4820K mainly running at 3.9 GHz using Turbo Boost
    32 GB 1600 MHz RAM over 4 channels, Windows 8.1
 
  Memory          s=s+x[m]*y[m]               x[m]=x[m]+y[m]
  KBytes    SSE2    SSE  3DNow   Sngl   SSE2    SSE  3DNow   Sngl
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

L1    4    40800  40863      0  19658  80200  80443      0  12830
      8    41176  41220      0  19265  78408  79958      0  12458
     16    41395  41403      0  19262  80913  81351      0  12372
     32    41457  41496      0  19097  81148  81959      0  12328
L2   64    41384  41355      0  17867  46570  46650      0  12088
    128    41456  41498      0  17722  49062  49192      0  12148
    256    41198  41249      0  17749  47763  48167      0  12074
L3  512    35492  35350      0  17968  30461  30443      0  11957
   1024    35503  35502      0  17985  30385  30418      0  11945
   2048    35809  35807      0  18098  30709  30392      0  12006
   4096    35048  35043      0  17588  30166  30182      0  11798
   8192    31740  32291      0  16997  27356  27306      0  11635
R 16384    20609  20109      0  13800  15237  15076      0   9988
  32768    19916  19732      0  13448  14515  14535      0   9866
  65536    19943  19678      0  13694  14531  14720      0   9889
 131072    19962  19652      0  13433  14471  14497      0  10016
 262144    19765  19716      0  13453  14696  14541      0   9917
 524288    19809  19765      0  13479  14728  14542      0   9997
1048576    20093  19736      0  13588  14515  14688      0   9913

R = RAM

 Divide       DP     SP     SP     SP     DP     SP     SP     SP
 Maximum 
 MB/S by       8      4             4     16      8             8    
 MAX MFLOPS 5182  10375      0   4915   5072  10245      0   1604
    

To Start


Windows SSEfpu Results

Separate tables of speeds obtained via L1 cache, L2 cache and RAM follow. The results include some for SSE64, the 64 bit version, in more64bit.zip. The the i387 normal floating point and 3DNow are not included, as the instructions are not supported at 64 bits. The SSE/SSE2 assembly code is the same as the original SSE3DNow, leading to no apparent difference in performance.

SSEfpu L1 Cache Results in MBytes/Second

|---- s=s+x[m]*y[m] -----| |---- x[m]=x[m]+y[m] ----| CPU MHz SSE2 SSE 3DNow Sngl SSE2 SSE 3DNow Sngl 80486 DX2 66 0 0 0 21 0 0 0 21 Pentium 200 0 0 0 259 0 0 0 143 Pentium MMX 200 0 0 0 314 0 0 0 178 Pentium Pro 200 0 0 0 523 0 0 0 273 AMD K62+ 500 0 0 1939 572 0 0 3325 344 Celeron A 450 0 0 0 1174 0 0 0 609 Pentium II 450 0 0 0 1197 0 0 0 636 Pentium III 550 0 3511 0 1451 0 3341 0 768 Pentium IIIEB 800 0 5029 0 2076 0 4792 0 1047 Atom M 1600 3015 7232 0 2099 6189 12168 0 1155 Pentium 4 1900 9285 9338 0 2583 9331 9350 0 1237 Duron 750 0 0 5001 2601 0 0 6200 1330 P4 Xeon 2200 10738 10729 0 2805 11499 11505 0 1392 Pentium 4 2533 12389 12440 0 3223 12922 12853 0 1640 Pentium 4 3066 15116 15035 0 3912 15397 15288 0 2099 Athlon Tbird 1200 0 0 7865 4082 0 0 9721 2084 Celeron M 1295 9257 9891 0 4675 10025 9634 0 2006 Pentium 4 3678 18107 18084 0 4688 19201 19269 0 2391 Athlon XP 1400 0 10671 9344 4866 0 10262 11641 2443 Pentium 4E 3000 17765 17511 0 5758 20830 20605 0 3197 Core 2 Duo M 1830 18704 18482 0 6074 28235 28306 0 3429 Athlon XP 1733 0 13466 11998 6090 0 13727 14566 3088 Ath4 Barton 1800 0 13470 11916 6245 0 13975 14791 3200 Turion 64 M 1900 13697 14236 11785 6347 14693 14563 11420 3317 Pentium M 1862 13363 14338 0 6730 14463 14460 0 2893 Celeron C2 M 2000 19513 19481 0 6877 30564 29729 0 4051 Opteron 1990 15240 15232 13616 6928 15670 15668 12589 3523 Athlon 64 1995 15355 15245 13722 6985 15796 15794 12614 3552 Athlon XP 2080 0 15635 13836 7267 0 16237 17303 3698 Athlon 64 64 bit 2210 15988 15993 16179 17121 Athlon 64 2210 16382 16280 14518 7678 17531 17320 13931 4020 Core 2 Duo 64 bit 2400 25326 25389 37440 37532 Core 2 Duo 1 CP 2400 25371 25340 0 8768 37973 37972 0 4717 Core i5 2467M 2300 23843 23716 0 11390 44512 44790 0 6805 Phenom II 64 bit 3000 22235 21441 34485 45329 Phenom II 3000 23213 23217 19426 11433 45662 45826 25700 5373 Core i7 3060 30731 30730 0 12116 45838 45849 0 6071 Core i7 3460 35467 36212 0 13633 53307 46427 0 4414 Core i7 64 bit 3900 41037 41038 78440 78254 Core i7 4820K 3900 40800 40863 0 19658 80200 80443 0 12830 Maximum 41037 41038 19426 19658 80200 80443 25700 12830 Maximum MFLOPS 5130 10260 4857 4915 5013 10955 3213 1604
Go To Start


SSEfpu L2 and L3 Cache Results in MBytes/Second

|---- s=s+x[m]*y[m] -----| |---- x[m]=x[m]+y[m] ----| CPU MHz SSE2 SSE 3DNow Sngl SSE2 SSE 3DNow Sngl 80486 DX2 66 0 0 0 13 0 0 0 11 Pentium 200 0 0 0 138 0 0 0 98 Pentium MMX 200 0 0 0 178 0 0 0 118 Pentium Pro 200 0 0 0 481 0 0 0 206 AMD K62+ 500 0 0 1180 467 0 0 1342 300 Celeron A 450 0 0 0 806 0 0 0 574 Pentium II 450 0 0 0 711 0 0 0 386 Pentium III 550 0 2338 0 1235 0 1830 0 739 Pentium IIIEB 800 0 3186 0 1874 0 2614 0 1016 Atom M 1600 2544 4303 0 1800 4441 4805 0 1053 Pentium 4 1900 9362 9149 0 2418 7206 7053 0 959 Duron 750 0 0 1916 1169 0 0 1579 730 P4 Xeon 2200 10168 10209 0 2624 7976 7991 0 1529 Pentium 4 2533 11791 11829 0 3086 9406 9383 0 1784 Pentium 4 3066 14187 13063 0 3624 11031 11020 0 2117 Athlon Tbird 1200 0 0 3258 2024 0 0 2530 1134 Celeron M 1295 5661 5686 0 3635 4319 4307 0 1641 Pentium 4 3678 16454 17103 0 4324 13092 13213 0 2642 Athlon XP 1400 0 2330 2474 1804 0 2403 2785 1320 Pentium 4E 3000 16562 16632 0 4839 13078 13045 0 3024 Core 2 Duo M 1830 12608 12777 0 5944 12157 11768 0 3067 Athlon XP 1733 0 4600 4671 2640 0 3476 3461 1708 Ath4 Barton 1800 0 4673 4779 2615 0 3537 3517 1754 Turion 64 M 1900 5267 5547 4876 3304 3239 3124 3270 1779 Pentium M 1862 8225 8222 0 5282 6027 5952 0 2435 Celeron C2 M 2000 14008 13687 0 6925 13565 13565 0 3098 Opteron 1990 7223 7409 6686 3819 4036 4038 4286 1953 Athlon 64 1995 7200 7467 6707 3849 3736 3742 4068 1891 Athlon XP 2080 0 5577 5828 3276 0 4227 4184 2065 Athlon 64 2210 7822 7802 7522 4186 4792 4939 4851 2226 Athlon 64 64 bit 2210 8471 7930 5940 5916 Core 2 Duo 1 CP 2400 16839 17072 0 8334 16389 15920 0 4111 Core 2 Duo 64 bit 2400 18281 18536 17041 17065 Core i5 2467M 2300 23974 23578 0 11131 27176 28358 0 6605 Phenom II 3000 23022 23039 16241 11246 17541 17394 14997 4965 Phenom II 64 bit 3000 23409 23357 18237 18163 Core i7 3060 27558 27566 0 11019 25457 25367 0 5509 Core i7 3460 31812 28222 0 12454 28270 29044 0 4818 Core i7 64 bit 3900 41560 41640 50611 50462 Core i7 4820K 3900 41456 41498 0 17722 49062 49192 0 12148 Maximum 41560 41640 16241 17722 50611 50462 14997 12148 Maximum MFLOPS 5195 10410 4060 4430 3163 6308 1875 1518 L3 Cache Core i5 2467M 2300 21091 20073 0 10959 18897 18205 0 6606 Phenom II 3000 10205 10746 8609 7401 9510 9535 8595 4673 Core i7 3060 22999 23037 0 10800 18395 18390 0 5472 Core i7 3460 25824 23662 0 12348 20689 21053 0 4525 Core i7 64 bit 3900 36080 36324 31720 31725 Core i7 4820K 3900 35048 35043 0 17588 30166 30182 0 11798
Go To Start


SSEfpu RAM Speed Results in MBytes/Second

|---- s=s+x[m]*y[m] -----| |---- x[m]=x[m]+y[m] ----| CPU MHz SSE2 SSE 3DNow Sngl SSE2 SSE 3DNow Sngl 80486 DX2 66 0 0 0 13 0 0 0 9 Pentium 200 0 0 0 77 0 0 0 60 Pentium MMX 200 0 0 0 110 0 0 0 84 Pentium Pro 200 0 0 0 128 0 0 0 82 AMD K62+ 500 0 0 196 175 0 0 135 126 Athlon Tbird P2 1200 0 0 417 215 0 0 340 185 Duron P2 750 0 0 525 284 0 0 379 215 Pentium II P1 450 0 0 0 291 0 0 0 166 Celeron A P1 450 0 0 0 335 0 0 0 178 Pentium III P1 550 0 670 0 359 0 297 0 175 Pentium IIIEB P2 800 0 836 0 402 0 421 0 257 Ath4 Barton #D1 1800 0 592 582 553 0 399 385 395 Core 2 Duo A #DC3 2040 717 717 0 811 633 633 0 644 Athlon XP D1 1400 0 1118 1040 886 0 730 677 637 Pentium 4 P2 1900 974 975 0 945 594 595 0 588 Athlon XP D1 1733 0 1371 1381 1116 0 1061 921 928 Athlon XP D2 2080 0 1391 1364 1212 0 999 923 851 Athlon XP D2 2170 0 1631 1625 1338 0 1240 1093 1121 Celeron M 1295 1504 1517 0 1345 822 832 0 758 Pentium 4 D1 2533 1415 1386 0 1401 847 841 0 840 Atom M DD2 1600 2303 2854 0 1660 1697 1762 0 999 Pentium 4 D1 2533 1852 1843 0 1750 626 628 0 616 Pentium 4 D2 3066 1883 1878 0 1802 1034 1034 0 1000 P4 Xeon R1 2200 2427 2426 0 1968 1240 1240 0 1087 Pentium 4 DC1 2533 2187 2180 0 2018 1286 1280 0 1169 Opteron D3 1990 2601 2605 2548 2322 2061 2044 2112 1567 Pentium 4 R2 2533 3504 3494 0 2480 1743 1746 0 1440 Turion 64 M DC3 1900 2862 2858 2805 2419 2052 2118 2140 1507 Pentium M DC1 1862 2399 2331 0 2491 1380 1381 0 1340 Athlon 64 D3 1995 2688 2711 2656 2564 1564 1558 1559 1478 Athlon 64 64b DC2 2210 3166 3193 2044 2043 Athlon 64 DC2 2210 3325 3329 3339 2935 2074 2080 2071 1804 Celeron C2 M DC3 2000 3096 3100 0 3080 1926 1726 0 1901 Pentium 4E DC2 3000 3639 3672 0 3240 2383 2380 0 2259 Pentium 4 DC2 3678 4408 4389 0 3498 2648 2639 0 2102 Core 2 Duo M DC4 1830 4388 4425 0 4144 2422 2521 0 2320 Phenom 64 bit DC7 3000 6329 6279 4680 4460 Phenom II DC7 3000 6511 6538 5083 4308 4700 4773 4576 3572 Core 2 Duo B DC3 2400 4904 4895 0 4752 3131 3134 0 3015 Core 2 64 bit DC5 2400 5628 5630 3617 3713 Core 2 Duo C DC5 2400 5777 5749 0 5157 3866 3823 0 3371 Core i7 DC6 3060 9196 9191 0 6561 7035 7049 0 4454 Core i7 DC7 3460 12467 10690 0 7739 8357 8264 0 4486 Core i5 2467M DC7 2300 14105 12928 0 8969 10917 10781 0 5933 Core i7 64 bit DC7 3900 19410 19406 13895 13472 Core i7 4820K QC8 3900 20093 19736 0 13588 14515 14688 0 9913 Maximum 20093 19736 5083 13588 14515 14688 4576 9913 Maximum MFLOPS 2512 4934 1271 3397 907 1836 572 1239 Key P1 100 MHz P2 133 MHz D1 DDR 133 MHz D2 DDR 166 MHz D3 DDR 200 MHz DC1 Dual Channel DDR 133 MHz DC2 Dual Channel DDR 200 MHz DC3 DDR2 533 MHz DC4 DDR2 666 MHz DC5 DDR2 800 MHz DC6 DDR3 1066 MHz DC7 DDR3 1333 MHz QC8 DDR3 1600 MHz 4 Channels R1 RDRAM 400 MHz R2 RDRAM 533 MHz # Slow speed example 64b 64 bit compilation C2D A # nForce 570 chipset C2D B/C Intel 965 Chipset
Go To Start

FFT Benchmarks

The FFT benchmarks started life in early 2000, based on a program from Scott Taylor of DSP Systems Inc. The Windows versions were titled FFTGraf. Three of them were produced that provide a graphical output, starting with one that was optimised all C code. The second one was further optimised including assembly language. The third had SSE SIMD assembly code and further tuning changes. Further details can be found in fftgraf results.htm. The benchmarks and source codes can be downloaded from fftgraf.zip.

The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run a number of times to identify variance. Besides the graph, results are displayed and saved in a log file, with FFT running time in milliseconds. An example of results is shown below. As shown, some checks of numeric calculations are carried out on the largest FFTs. These are subject to variation due to different rounding effects.

The latest are all C code, with only text output, with FFT1, being the original and FFT3c, the third one with rearranged C statements, instead of assembly code. These comprise 32 bit and 64 bit versions to run via Windows, Linux and Android. Further details and results are in FFTBenchmarks.htm. An example of the latest 64 bit benchmark is also provided below, for the same Core i7, via Windows. Note similar performance to FFTGraf and different sumchecks.


FFTGraf Example Log File Core i7 4820K mainly running at 3.9 GHz 

     FFTGraf Test  Version 3.00 Sun Sep 24 17:08:35 2017
 
 By Roy Longbottom via Scott Taylor's code and now SSE, SSE2, 3DNow

  Windows NT Version 6.2, build 9200, 
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4, 3711 MHz
  From GlobalMemoryStatus: Size 2097151 KB, Free 2097151 KB
  5 Passes Min 1 K Max 1024 K Max Seconds 15

 Size  Single Precision FFTs using SSE
    K  Millisecond each pass
    1  0.015  0.011  0.011  0.011  0.011
    2  0.023  0.023  0.023  0.023  0.023
    4  0.050  0.050  0.050  0.050  0.050
    8  0.125  0.125  0.140  0.125  0.126
   16  0.323  0.309  0.316  0.308  0.308
   32  0.707  0.687  0.689  0.698  0.688
   64   1.55   1.48   1.48   1.48   1.48
  128   3.29   3.20   3.21   3.20   3.20
  256   7.30   6.97   6.97   6.96   6.96
  512   16.4   16.1   16.1   16.3   16.1
 1024   39.3   38.4   38.7   38.7   38.4

 Size  Double Precision FFTs using SSE2
    K  Millisecond each pass
    1  0.014  0.014  0.014  0.014  0.013
    2  0.030  0.030  0.030  0.030  0.030
    4  0.071  0.071  0.071  0.071  0.071
    8  0.154  0.155  0.153  0.156  0.154
   16  0.403  0.400  0.410  0.400  0.400
   32  0.888  0.869  0.868  0.885  0.869
   64   1.87   1.85   1.88   1.86   1.85
  128   3.98   3.95   3.96   3.96   3.95
  256   9.02   8.93   8.90   8.87   8.87
  512   22.7   22.6   22.9   22.7   22.6
 1024   57.0   55.5   55.5   55.4   55.5

 Checks SP  9.999890e-001  3.338029e-006  1.043487e-011
 Checks DP  1.000000e+000  1.133294e-023  1.428096e-028

    End FFT Test Sun Sep 24 17:08:38 2017


   FFT 64 Bit Benchmark Version 3c.0 Mon Sep 25 10:54:58 2017

  Size                     milliseconds
    K     Single Precision              Double Precision
    1     0.019     0.013     0.012     0.013     0.012     0.012
    2     0.029     0.026     0.026     0.028     0.028     0.027
    4     0.063     0.059     0.059     0.071     0.070     0.070
    8     0.153     0.143     0.142     0.165     0.164     0.164
   16     0.364     0.335     0.334     0.385     0.384     0.384
   32     0.750     0.725     0.735     0.814     0.814     0.815
   64     1.670     1.555     1.556     1.751     1.765     1.750
  128     3.528     3.375     3.386     3.778     3.748     3.748
  256     7.663     7.331     7.342     8.772     8.898     8.727
  512    17.683    17.250    17.229    23.083    22.583    22.768
 1024    43.607    42.397    42.613    58.202    56.390    55.824

        1024 Square Check Maximum Noise Average Noise
        SP  9.999520e-001  3.346482e-006  4.565234e-011
        DP  1.000000e+000  1.133294e-023  1.428110e-028

To Start


Windows FFT Results

Below is an example of the graph produced by FFTGraf, running on a 3900 MHz Core i7. It is automatically scaled at run time and is based on milliseconds per K FFT size to reduce the range, as opposed to milliseconds, where the slowes can be thousands of times greater than the fastest. The graph also indicates the range of memory address space used.

Following this is a table showing the number of floating point operations used at each FFT size and the calculations of MFLOPS for the three versions of FFTGraf (FP op count/1000/milliseconds from the later tables). The original version comprised compiled C code. Analysis of the data flow identified that access was largely dependent on skipped sequential addresses. The second version included assembly code for the critical calculations, data loading in segments into L2 cache and optimised use from burst reading of data. The third version made use of SSE or 3DNow SIMD instructions. As seen in the table, performance of the larger FFTs could be increased by more than three times. Note more than five times was noted on earlier PCs.

After the above are keys for cache and RAM sizes used on systems identified in the later detailed tables.

Graph

      3900 MHz Core i7 MFLOPS - from original FFTGraf results

                         FFT1          FFT2          FFT3
  FFT size  FP op count     SP     DP     SP     DP     SP     DP

      1024        53312   2539   2318   3332   3136   5331   4101
      2048       116864   2486   2164   3437   3075   5312   3895
      4096       254080   2310   1815   3630   2823   5082   3630
      8192       549120   1961   1771   3230   2890   4576   3661
     16384      1179904   1967   1686   2950   2510   4069   2950
     32768      2523648   1682   1262   2804   2524   3605   2804
     65536      5374464   1311   1221   2829   2443   3839   2829
    131072     11404288   1267   1267   2782   2479   3801   2851
    262144     24118272   1269   1049   2680   2412   3445   2680
    524288     50857984   1060    892   2543   1956   3391   2312
   1048576    106956800    947    557   2183   1725   2891   1945

                    Cache & RAM Key

 L1 and L2 cache size e.g. 16 = 8 KB L1 and 256 KB L2

 1 =   8 KB  2 =  16 KB   3 = 32 KB   4 = 64 KB  5 = 128 KB
 6 = 256 KB  7 = 512 KB   8 =  1 MB   9 =  2 MB  A =   4 MB
 H =  24 KB  
 Z = 512 KB + 6 MB      X = 256 KB + 8 MB   W = 256 KB + 3 MB, V = 256 KB + 10 MB
 B = L2 on memory bus   F = At CPU MHz      H = Half CPU MHz   

Bus/Memory Speed

 Numbers 33, 50, 66, 100, 133 = MHz   

 DD1 = DDR at 133 MHz    DC1 = Dual Channel DDR at 133 MHz
 DD2 = DDR at 166 MHz    DC2 = Dual Channel DDR at 166 MHz
 DD3 = DDR at 200 MHz    DC3 = Dual Channel DDR at 200 MHz
 RD2 = RDRAM  400 MHz    RD1 = One  Channel RDRAM  400 MHz
 RD3 = RDRAM  533 MHz    DC4 = DDR2   533 MHz    DC5 = DDR2  666 MHz
 DC6 = DDR2  800 MHz     DC7 = DDR3  1066 MHz    DC8 = DDR3 1333 MHz    
 QC9 = DDR3 1600 MHz 4 channel     SCC = DDR2 533 MHz single channel
 # = Paticularly slow memory   S - last column - uses SSE or SSE2 instructions

To Start


FFTGraf Version 1

Single Precision Milliseconds
                     Cache       FFT Size K --->                                              
Processor       MHz  & RAM       1     2     4     8    16    32    64   128   256   512  1024

80486            66 15B  33     17    39    85   196   509  1240  2752  5864 12427            
Pentium         100 16B  50    3.0   9.7    22    54   127   307   801  1790  3844            
Pentium MMX     200 27B  66    1.2   3.1    11    24    52   119   277   807  1806  3844      
Pentium Pro     200 16F  66    1.1   2.9   6.4    14    37   101   358   797  1740  3717      
Celeron A       400 25F  66   0.36  0.85   2.6   7.5    36   106   254   569  1188  2543  5356
Pentium II      450 27H 100   0.32  0.86   4.1   9.2    20    47   132   395   985  2257  4627
Pentium IIIE    550 26F 100   0.26  0.60   1.6   3.5    12    34   134   309   684  1461  3313
Pentium IIIEB   733 26F 133   0.19  0.46   1.2   2.6   6.2    27   128   291   626  1377  2876
Pentium IIIEB  1000 26F 133   0.14  0.33  0.82   1.8   4.6    33   122   300   657  1414  3029
Pentium IIIEB  1000 26F RD1   0.14  0.33  0.82   1.8   4.2    16    91   216   478  1029  2126
Pentium 4      1500 16F RD2   0.14  0.33  0.77   1.7   4.3    17    93   235   565  1296  2809
Pentium 4      1900 16F 133   0.11  0.27  0.60   1.4   3.4    18   172   402   907  1985  4214
P4 Xeon        2200 17F RD2  0.093  0.23  0.53   1.2   3.0   7.4    31   194   480  1121  2435
Celeron M      1295 38F      0.089  0.20  0.49   1.4   3.0   6.6    15    75   584  1379  3121
Pentium 4E     3000 28F DC3  0.072  0.15  0.38  0.83   1.8   4.2    10    40   226   494  1043
Pentium 4N     3066 17F DD1  0.067  0.17  0.37  0.84   2.1   5.3    32   268   617  1368  2877
Pentium M2     1862 39F DC1  0.063  0.14  0.34  0.94   2.1   4.5    10    24    78   452  1266
Atom M         1600 H7F SCC   0.53  0.57   1.3   3.0   6.5    15    51   228   506  1095  2241
Core 2 Duo M   1830 39F DC5  0.078  0.19  0.34  0.94   2.2   4.8    11    24    80   318   814
Celeron C2 M   2000 38F DC4  0.053  0.13  0.31  0.86   2.0   4.7    10    54   264   571  1211
Core2 Duo A1CP 2400 3AF DC4  0.043  0.11  0.26  0.72   1.7   3.7   8.2    18    42   134  1404
Core2 Duo B1CP 2400 3AF DC4  0.043  0.11  0.26  0.72   1.7   3.7   8.2    18    42   108   565
Core i5 2467M  2300 3WF DC8  0.036 0.080  0.18  0.48   1.1   2.5   6.7    16    34   111   258
Core i7 930    3060 3XF DC7  0.033 0.076  0.18  0.46   1.0   2.4   6.5    14    31    75   168
Core i7 860    3460 3XF DC8  0.033 0.076  0.18  0.46   1.0   2.4   6.3    14    30    72   171
Core i7 4820K  3900 3VF QC9  0.021 0.047  0.11  0.28   0.6   1.5   4.1     9    19    48   113

AMD K62         350 37B 100    1.1   2.4   6.2    27    65   167   375   903  2012  4336  9219
Duron           700 44F 133   0.17  0.37  0.82   2.4    14    74   170   399  1065  2423  5361
Athlon Tbird   1200 46F 133   0.10  0.21  0.46   1.3   6.1    20   167   401   934  2056  4605
Athlon 4       1725 46F DD1  0.066  0.15  0.32  0.91   4.3    11    82   193   462  1035  2160
Athlon 4 Bart  1800 47F#DD1  0.064  0.14  0.32  0.88   4.1    10    28   361   819  1800  3716
Turion 64 M    1900 47F DC4  0.072  0.16  0.34  0.89   4.0   9.3    23    99   233   556  1226
Athlon XP      2080 46F DD2  0.056  0.12  0.27  0.76   3.5   9.2    74   176   428   967  2014
Athlon 64aa    2210 47F DC3  0.051  0.11  0.25  0.73   3.0   7.4    17   101   227   514  1139
Phenom         3000 4ZF DC8  0.037 0.082  0.19  0.50   1.8   4.4    11    30    66   192   598

Double Precision Milliseconds

80486            66 15B  33     21    46    99   262   677  1493  3131  6595 13489            
Pentium         100 16B  50    4.7    11    29    65   159   415   947  2010  4256            
Pentium MMX     200 27B  66    1.6   5.5    13    27    60   176   393   903  1911  4051      
Pentium Pro     200 16F  66    1.4   3.0   6.7    19    65   190   430   925  1980  4222      
Celeron A       400 25F  66   0.49   1.2   4.6    18    56   134   295   635  1385  2916  6081
Pentium II      450 27H 100   0.45   2.1   4.7    10    23    65   233   528  1161  2381  4935
Pentium IIIE    550 26F 100   0.30  0.79   1.7   3.9    17    82   193   416   848  1819  3742
Pentium IIIEB   733 26F 133   0.23  0.60   1.3   3.0    16    69   160   349   750  1600  3295
Pentium 4      1500 16F RD2   0.19  0.43  0.96   2.5   9.8    48   119   284   650  1392  3080
Pentium IIIEB  1000 26F 133   0.17  0.41   1.0   2.6    16    68   166   360   772  1645  3493
Pentium IIIEB  1000 26F RD1   0.17  0.41  0.91   2.1   8.7    47   110   240   512  1142  2400
Pentium 4      1900 16F 133   0.15  0.33  0.75   1.9    13    92   213   463  1006  2095  4463
P4 Xeon        2200 17F RD2   0.13  0.29  0.65   1.7   4.1    19   101   247   574  1217  2684
Celeron M      1295 38F       0.11  0.25  0.67   1.5   3.2   7.2    39   296   712  1518  3127
Pentium 4N     3066 17F DD1  0.084  0.19  0.43   1.1   2.8    19   138   314   696  1421  3070
Pentium 4E     3000 28F DC3  0.076  0.20  0.42  0.93   2.1   5.0    22   114   251   524  1144
Pentium M2     1862 39F DC1  0.074  0.17  0.47   1.0   2.2   4.9    12    45   260   625  1361
Atom M         1600 H7F SCC   0.26  0.64   1.4   3.1   6.8    26   118   262   567  1156  2439
Core 2 Duo M   1830 39F DC5  0.069  0.17  0.45   1.0   2.3   5.0    12    41   200   428   871
Celeron C2 M   2000 38F DC4  0.064  0.15  0.42  0.94   2.1   4.6    26   139   301   605  1231
Core2 Duo A1CP 2400 3AF DC4  0.052  0.13  0.35  0.79   1.8   3.9   8.5    20    85   781  1824
Core2 Duo B1CP 2400 3AF DC4  0.052  0.13  0.35  0.79   1.8   3.9   8.5    20    54   293   704
Core i5 2467M  2300 3WF DC8  0.041 0.094  0.24  0.54   1.2   3.3   7.3    17    55   128   281
Core i7 930    3060 3XF DC7  0.040 0.091  0.23  0.52   1.2   3.2   7.1    15    37    86   284
Core i7 860    3460 3XF DC8  0.040 0.092  0.23  0.52   1.2   3.1   6.8    15    36    87   259
Core i7 4820K  3900 3VF QC9  0.023 0.054  0.14  0.31   0.7   2.0   4.4     9    23    57   192

AMD K62         350 37B 100    1.1   3.0    12    24    66   172   501  1141  2448  5082 10275
Duron           700 44F 133   0.20  0.43   1.3   7.6    39    90   205   547  1248  2756  5972
Athlon Tbird   1200 46F 133   0.11  0.23  0.66   3.0    11    89   209   529  1188  2605  5629
Athlon 4       1725 46F DD1  0.074  0.16  0.47   2.1   5.8    47   107   248   545  1146  2464
Athlon 4 Bart  1800 47F#DD1  0.075  0.16  0.46   1.9   4.6    16   186   422   926  1918  4065
Turion 64 M    1900 47F DC4  0.069  0.15  0.44   1.9   4.4    11    50   118   277   614  1366
Athlon XP      2080 46F DD2  0.065  0.14  0.40   1.7   4.7    34    83   211   479  1009  2196
Athlon 64aa    2210 47F DC3  0.058  0.13  0.36   1.4   3.4   8.9    51   119   258   559  1219
Phenom         3000 4ZF DC8  0.042  0.10  0.25  0.90   2.2   5.3    15    33    94   303   740

Go To Start


FFTGraf Version 2

Single Precision Milliseconds Cache FFT Size K ---> Processor MHz & RAM 1 2 4 8 16 32 64 128 256 512 1024 80486 66 15B 33 16 35 82 186 403 858 1870 3948 8451 Pentium 100 16B 50 3.1 7.3 16 36 86 195 431 924 1952 Pentium MMX 200 27B 66 1.4 3.3 8.0 17 38 87 194 423 899 1894 Pentium Pro 200 16F 66 0.67 1.5 3.2 7.0 23 54 119 250 526 1115 Celeron A 400 25F 66 0.30 0.68 1.7 6.4 16 38 84 189 401 850 1789 Pentium IIIE 550 26F 100 0.21 0.47 1.1 2.3 7.1 19 44 95 201 429 940 Pentium IIIEB 660 26F 133 0.17 0.39 0.9 1.9 6.2 17 38 85 188 410 872 Celeron 2 900 25F 100 0.13 0.30 0.82 2.7 12 33 73 166 344 736 1568 PIII Tualatin 1266 27F 133 0.088 0.20 0.45 1.0 2.3 6.1 19 50 117 264 569 Pentium 4 1900 16F 133 0.075 0.18 0.46 1.1 3.3 9.4 27 69 160 353 768 Celeron M 1295 38F 0.073 0.16 0.36 0.84 1.9 4.3 11 34 91 211 484 Pentium 4E 3000 28F DC3 0.061 0.13 0.40 1.0 2.2 4.7 10 25 61 128 297 Pentium 4N 2400 17F RD2 0.060 0.14 0.35 0.78 1.9 5.1 16 48 118 259 575 Pentium 4N 2400 17F 133 0.060 0.14 0.35 0.78 2.0 5.9 20 58 128 283 648 Pentium M2 1862 39F DC1 0.052 0.11 0.25 0.59 1.4 2.9 6.3 16 41 107 245 Pentium 4N 3066 17F DD1 0.045 0.12 0.28 0.62 1.5 4.3 15 46 111 235 524 Atom M 1600 H7F SCC 0.46 0.49 1.1 2.3 5.3 12 28 68 147 324 700 Core 2 Duo M 1830 39F DC5 0.048 0.11 0.25 0.58 1.4 2.9 6.4 15 37 90 198 Celeron C2 M 2000 38F DC4 0.044 0.10 0.23 0.53 1.2 2.7 6.1 17 42 96 216 Core2 Duo A1CP 2400 3AF DC4 0.035 0.080 0.18 0.44 1.0 2.2 4.7 10 27 83 246 Core2 Duo B1CP 2400 3AF DC4 0.036 0.080 0.18 0.44 1.0 2.2 4.8 10 24 60 151 Core2 Duo B1CP 2400 3AF DC6 0.054 0.12 0.19 0.44 1.0 2.2 4.7 11 24 58 140 Core i5 2467M 2300 3WF DC8 0.030 0.061 0.13 0.30 0.70 1.5 3.3 7.3 16 39 84 Core i7 930 3060 3XF DC7 0.026 0.054 0.12 0.27 0.64 1.4 3.0 6.5 14 32 78 Core i7 860 3460 3XF DC8 0.026 0.055 0.12 0.28 0.63 1.4 3.0 6.4 14 32 74 Core i7 4820K 3900 3VF QC9 0.016 0.034 0.07 0.17 0.40 0.9 1.9 4.1 9 20 49 Duron 700 44F 133 0.13 0.26 0.55 1.7 6.4 17 42 96 229 524 1199 Athlon Tbird 1200 46F 133 0.075 0.16 0.33 0.89 3.5 12 36 82 199 465 1089 Athlon 4 1410 46F DD1 0.062 0.13 0.27 0.76 2.2 6.8 18 41 100 217 497 Athlon 4 1794 46F DD3 0.049 0.11 0.22 0.60 1.8 5.3 14 31 75 163 364 Athlon 4 Bart 1800 47F#DD1 0.049 0.10 0.22 0.61 1.6 4.8 22 52 126 277 620 Turion 64 M 1900 47F DC4 0.047 0.10 0.20 0.55 1.5 3.7 11 26 59 132 301 Athlon XP 2080 46F DD2 0.043 0.089 0.19 0.52 1.6 4.9 13 29 71 171 380 Athlon 64aa 2210 47F DC3 0.040 0.086 0.18 0.47 1.2 3.0 9.2 21 47 106 247 Phenom 3000 4ZF DC8 0.026 0.056 0.12 0.30 0.75 1.8 4.5 11 24 57 162 Double Precision Milliseconds 80486 66 15B 33 20 50 113 251 536 1258 2660 5654 11698 Pentium 100 16B 50 4.0 8.6 20 50 121 268 582 1224 2614 Pentium MMX 200 27B 66 1.7 4.5 10 21 49 111 244 560 1148 2417 Pentium Pro 200 16F 66 0.91 2.0 4.2 14 35 81 172 374 817 1779 Celeron A 400 25F 66 0.34 0.88 4.2 13 38 78 172 365 782 1645 3486 Pentium IIIE 550 26F 100 0.23 0.55 1.2 3.0 11 26 58 127 278 618 1374 Pentium IIIEB 660 26F 133 0.19 0.46 1.0 2.6 10 23 54 121 262 577 1276 Celeron 2 900 25F 100 0.15 0.36 1.6 12 33 77 170 333 683 1438 3073 PIII Tualatin 1266 27F 133 0.10 0.23 0.49 1.1 4.3 11 31 78 184 412 917 Pentium 4 1900 16F 133 0.10 0.23 0.51 1.4 5.9 16 36 85 185 406 907 Celeron M 1295 38F 0.082 0.19 0.44 0.94 2.1 6.2 21 56 130 295 668 Pentium 4N 2400 17F 133 0.075 0.18 0.39 1.0 3.7 12 33 75 169 372 819 Pentium 4N 2400 17F RD2 0.074 0.18 0.39 1.0 3.0 9.0 23 57 128 285 651 Pentium 4E 3000 28F DC3 0.062 0.18 0.49 0.97 2.8 5.9 15 34 73 167 390 Pentium 4N 3066 17F DD1 0.058 0.14 0.30 0.80 2.6 8.6 24 56 124 273 620 Pentium M2 1862 39F DC1 0.058 0.13 0.31 0.65 1.5 3.2 8.2 23 63 146 334 Atom M 1600 H7F SCC 0.23 0.51 1.1 2.4 5.6 14 32 71 156 337 739 Core 2 Duo M 1830 39F DC5 0.055 0.12 0.29 0.63 1.4 3.1 7.2 20 48 105 233 Celeron C2 M 2000 38F DC4 0.051 0.12 0.27 0.58 1.3 3.3 9.9 24 53 118 269 Core2 Duo A1CP 2400 3AF DC4 0.041 0.094 0.22 0.48 1.1 2.3 5.0 16 59 164 418 Core2 Duo B1CP 2400 3AF DC4 0.041 0.094 0.22 0.48 1.1 2.3 5.0 12 32 83 191 Core2 Duo B1CP 2400 3AF DC6 0.042 0.10 0.22 0.48 1.1 2.4 5.2 12 31 75 167 Core i5 2467M 2300 3WF DC8 0.032 0.069 0.15 0.33 0.90 1.7 3.7 8.6 20 44 97 Core i7 930 3060 3XF DC7 0.028 0.062 0.14 0.30 0.73 1.6 3.4 7.2 17 41 95 Core i7 860 3460 3XF DC8 0.028 0.062 0.14 0.30 0.71 1.5 3.3 7.0 16 39 88 Core i7 4820K 3900 3VF QC9 0.017 0.038 0.09 0.19 0.47 1.0 2.2 4.6 10 26 62 Duron 700 44F 133 0.14 0.28 0.88 5.0 15 34 76 172 379 836 1870 Athlon Tbird 1200 46F 133 0.081 0.17 0.45 1.6 7.9 22 53 123 282 645 1485 Athlon 4 1410 46F DD1 0.066 0.14 0.38 1.3 4.7 12 26 61 137 306 695 Athlon 4 1794 46F DD3 0.057 0.11 0.31 1.1 3.8 9.5 21 47 105 227 517 Athlon 4 Bart 1800 47F#DD1 0.053 0.11 0.30 1.0 4.2 15 35 81 183 409 925 Turion 64 M 1900 47F DC4 0.049 0.10 0.28 1.0 2.6 7.1 16 34 78 177 396 Athlon XP 2080 46F DD2 0.046 0.10 0.26 0.89 3.6 8.8 19 44 102 229 527 Athlon 64aa 2210 47F DC3 0.041 0.086 0.22 0.77 2.1 5.8 13 29 66 147 326 Phenom 3000 4ZF DC8 0.028 0.059 0.15 0.45 1.0 2.5 5.6 13 32 88 216
Go To Start


FFTGraf Version 3

Single Precision Milliseconds Cache FFT Size K ---> Processor MHz & RAM 1 2 4 8 16 32 64 128 256 512 1024 Pentium 200 16B 66 1.5 3.9 8.4 19 43 97 220 484 1048 2218 4611 Pentium MMX 200 27B 66 1.4 3.2 7.8 17 38 86 192 417 882 1869 Pentium Pro 200 16F 66 0.80 1.8 3.8 8.3 22 55 121 263 557 1220 Pentium II 400 27H 100 0.30 0.77 2.5 5.4 12 31 81 189 409 876 1897 Celeron A 450 25F 100 0.27 0.60 1.4 3.3 10 24 51 109 237 502 1092 Pentium IIIE 550 26F 100 0.18 0.40 0.90 1.9 6.2 18 40 87 185 394 838 S Pentium 4 1900 16F 133 0.074 0.16 0.35 0.71 2.3 7.6 22 50 107 236 571 S Celeron M 1295 38F 0.071 0.15 0.33 0.83 1.9 4.3 11 33 86 194 436 S Pentium 4N 2400 17F 133 0.058 0.13 0.27 0.57 1.4 4.3 16 49 104 224 521 S Pentium 4N 2400 17F RD2 0.057 0.12 0.27 0.56 1.4 3.6 12 32 70 156 364 S Pentium 4N 2533 17F DD1 0.055 0.12 0.25 0.52 1.3 3.6 12 37 78 169 393 S Pentium 4N 2533 17F RD3 0.055 0.12 0.26 0.53 1.3 3.3 10 26 57 124 289 S Pentium 4E 3000 28F DC3 0.052 0.11 0.23 0.49 1.2 2.8 6.3 17 39 83 182 S Pentium M2 1862 39F DC1 0.050 0.10 0.23 0.58 1.3 2.9 6.3 15 39 95 213 S Pentium 4N 3066 17F DD1 0.044 0.10 0.21 0.44 1.1 3.1 11 33 71 154 359 S Pentium 4N 3678 17F DC3 0.038 0.086 0.18 0.37 0.91 2.3 6.9 19 42 92 231 S Atom M 1600 H7F SCC 0.22 0.23 0.58 1.2 2.9 6.6 17 42 92 200 437 S Core 2 Duo M 1830 39F DC5 0.033 0.07 0.16 0.38 0.89 2.0 5.0 10 27 65 136 S Celeron C2 M 2000 38F DC4 0.032 0.07 0.15 0.35 0.82 1.8 4.4 14 34 73 159 S Core2 Duo A1CP 2400 3AF DC4 0.024 0.053 0.12 0.29 0.67 1.5 3.2 6.8 19 66 213 S Core2 Duo B1CP 2400 3AF DC4 0.025 0.053 0.12 0.29 0.67 1.5 3.2 6.8 16 42 108 S Core i5 2467M 2300 3WF DC8 0.019 0.044 0.09 0.21 0.50 1.1 2.5 5.4 12 31 68 S Core i7 860 3460 3XF DC8 0.023 0.048 0.10 0.23 0.57 1.3 2.7 5.7 12 27 65 S Core i7 930 3060 3XF DC7 0.017 0.035 0.08 0.18 0.45 1.0 2.1 4.6 10 23 58 S Core i7 4820K 3900 3VF QC9 0.010 0.022 0.05 0.12 0.29 0.7 1.4 3.0 7 15 37 S Duron 750 44F 133 0.11 0.23 0.48 1.4 5.9 15 36 81 201 475 1112 Athlon Tbird 1200 46F 133 0.072 0.15 0.30 0.77 3.3 11 35 77 176 402 932 Athlon 4 1794 46F DD3 0.050 0.10 0.21 0.56 2.0 6.7 19 41 90 203 478 S Athlon 4 Bart 1800 47F#DD1 0.050 0.10 0.22 0.56 1.5 4.8 22 54 119 265 602 S Turion 64 M 1900 47F DC4 0.052 0.10 0.21 0.50 1.4 3.4 10 23 52 112 252 S Athlon XP 2080 46F DD2 0.043 0.089 0.19 0.48 1.6 5.0 13 29 65 150 364 S Athlon 64a 2000 48F DD3 0.041 0.083 0.17 0.46 1.3 3.1 7.2 20 49 112 262 S Opteron 2000 48F DD3 0.040 0.082 0.17 0.46 1.3 3.0 7.4 21 50 121 289 S Athlon 64aa 2210 47F DC3 0.036 0.074 0.15 0.40 1.1 2.9 8.4 18 43 95 214 S Phenom 3000 4ZF DC8 0.020 0.041 0.085 0.22 0.60 1.5 3.8 8.5 19 46 133 S Double Precision Milliseconds Pentium 200 16B 66 2.2 4.8 11 23 58 131 291 701 1509 3031 6274 Pentium MMX 200 27B 66 1.6 4.4 9.4 20 49 111 244 532 1133 2392 Pentium Pro 200 16F 66 1.0 2.2 4.9 17 37 88 193 410 881 1932 Pentium II 400 27H 100 0.41 1.6 3.5 7.7 21 54 128 285 620 1339 2914 Celeron A 450 25F 100 0.29 0.73 1.7 8.4 21 44 95 205 445 949 2037 Pentium IIIE 550 26F 100 0.23 0.53 1.1 2.8 11 25 57 127 276 604 1319 Celeron M 1295 38F 0.10 0.23 0.54 1.2 2.7 7.3 23 60 138 308 691 S Pentium 4 1900 16F 133 0.091 0.19 0.40 1.0 4.9 14 32 69 147 322 755 S Pentium 4N 2400 17F 133 0.071 0.15 0.31 0.70 3.0 11 31 68 147 315 712 S Pentium M2 1862 39F DC1 0.070 0.16 0.38 0.81 1.9 4.0 10 26 66 151 331 S Pentium 4N 2400 17F RD2 0.069 0.14 0.31 0.68 2.3 7.1 19 42 92 197 440 S Pentium 4N 2533 17F RD3 0.067 0.14 0.30 0.66 2.0 5.8 15 34 74 158 354 S Pentium 4N 2533 17F DC1 0.065 0.14 0.30 0.64 2.2 6.8 19 41 89 190 428 S Pentium 4E 3000 28F DC3 0.058 0.13 0.28 0.60 1.5 3.7 11 24 51 114 255 S Pentium 4N 3066 17F DD1 0.054 0.11 0.24 0.54 2.1 7.3 21 45 99 212 475 S Pentium 4N 3678 17F DC3 0.046 0.10 0.21 0.44 1.4 4.2 11 25 54 115 256 S Atom M 1600 H7F SCC 0.19 0.46 0.99 2.1 5.2 13 30 70 155 339 746 S Core 2 Duo M 1830 39F DC5 0.042 0.094 0.23 0.50 1.1 2.5 6.3 17 42 89 190 S Celeron C2 M 2000 38F DC4 0.040 0.087 0.21 0.46 1.1 2.8 9.1 22 49 106 232 S Core2 Duo A1CP 2400 3AF DC4 0.031 0.070 0.17 0.38 0.87 1.9 4.1 13 50 154 362 S Core2 Duo B1CP 2400 3AF DC4 0.031 0.071 0.18 0.38 0.87 1.9 4.1 10 28 70 158 S Core i5 2467M 2300 3WF DC8 0.024 0.059 0.14 0.27 0.66 1.5 3.2 7.2 17 38 84 S Core i7 860 3460 3XF DC8 0.028 0.062 0.14 0.30 0.76 1.7 3.6 7.6 17 40 88 S Core i7 930 3060 3XF DC7 0.023 0.051 0.12 0.26 0.65 1.4 3.1 6.5 15 37 86 S Core i7 4820K 3900 3VF QC9 0.013 0.030 0.07 0.15 0.40 0.9 1.9 4.0 9 22 55 S Duron 750 44F 133 0.12 0.25 0.85 5.1 15 33 73 163 365 821 1872 Athlon Tbird 1200 46F 133 0.080 0.16 0.46 1.6 8.1 23 52 120 274 636 1505 Athlon 64a 2000 48F DD3 0.063 0.13 0.34 1.1 2.5 5.9 16 35 77 174 392 S Opteron 2000 48F DD3 0.062 0.13 0.34 1.1 2.5 6.0 15 34 80 187 433 S Athlon 64aa 2210 47F DC3 0.056 0.12 0.29 0.90 2.4 6.3 14 30 65 145 315 S Athlon 4 1794 46F DD3 0.049 0.10 0.32 1.2 4.6 12 25 53 117 265 636 Athlon 4 Bart 1800 47F#DD1 0.049 0.10 0.31 1.2 4.2 15 36 79 172 373 872 Turion 64 M 1900 47F DC4 0.068 0.14 0.36 1.1 2.9 7.4 16 35 76 167 367 S Athlon XP 2080 46F DD2 0.043 0.092 0.27 0.99 3.6 9.1 20 42 90 205 482 Phenom 3000 4ZF DC8 0.028 0.058 0.15 0.53 1.3 2.9 6.3 14 32 82 186 S
Go To Start

Linux Benchmarks

The Linux benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are included below. A 64 bit version of this Ubuntu was installed on an external USB 3 disk drive to work on a PC that boots to UEFI mode. Another 64 bit version was installed on a USB 2 flash drive that can be used successfully on different PCs. Then, a 32 bit version was used to compile 32 bit benchmarks. In order to run the latter on 64 bit systems, 32 bit lib386 Shared Object files have to be installed. Numerous proposed methods of installing these are available on the web. The methods used for the USB disk drive installation handle all benchmarks tried so far, but the 64 bit flash drive lacks support for 32 bit OpenMP.

Some results included were compiled and run via older Ubuntu releases, including 32 bit versions.

Three of the benchmarks, including source code and compile commands, are in memory_benchmarks.tar.gz, with others in AVX_benchmarks.tar.gz, linux_openmp.tar.gz and FFTbenchmarks.zip. Further details are provided below, including differences to the Windows benchmarks in some of the functions used.

    


Go To Start


MemSpeed memory_speed32, memory_speed64, memory_speed64AVX

The non-AVX tests were not included in the memory_benchmark collection, because of the inexplicable slow performance of the Windows MemSpeed program. At a later date, OpenMP based banchmarks were produced, including one for MemSpeed, also non-OpenMP normal versions by omitting OMP directives. These memory_speed32 and memory_speed64 benchmarks are in linux_openmp.tar.gz and those for memory_speed64AVX in AVX_benchmarks.tar.gz.

These benchmarks are somewhat different to the Windows version, using all C functions instead of assembly code and the first set of tests comprising x[m]=x[m]+s*y[m] instead of s=s+x[m]*y[m], the former being equivalent to the performance dependent calculations in the Linpack Benchmark.

The AVX version is produced from the same source code by simply including the -mavx parameter in the compile command. Note that running this benchmark on CPUs without AVX functions, leads to an illegal instruction indication.

The first calculations are of the following format, but with addition for the y[] calculations using integers. Due to an oversight, the sum variable was zero, as used in the earlier assembly code, and omitted by the compiler. The code then became the same as the second set of integer calculations.

for (m=0; m<kd; m=m+inc) { x[m] = x[m] + sum * y[m]; x[m+1] = x[m+1] + sum * y[m+1]; x[m+2] = x[m+2] + sum * y[m+2]; x[m+3] = x[m+3] + sum * y[m+3]; } Integer actually x[m] = x[m] + y[m]; etc.
Below is an example log of all results using a Core i7 CPU. Note Int32 recorded speeds.

    Core i7 4820K mainly running at 3.9 GHz using Turbo Boost
            1600 MHz RAM over 4 channels, Windows 10

     Memory Reading Speed Test 64 Bit Version 4.1 by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

L1     4   35055  24094  50991  35054  24063  50855  28639  19434  28857
       8   35561  24552  56045  35533  24531  56189  29986  20126  29989
      16   35600  24756  59167  35651  24779  59100  30580  20485  30631
      32   35649  24897  60694  35696  24882  60668  29004  20668  30922
L2    64   31928  24488  47659  33712  24869  47557  23979  20652  29566
     128   31968  24523  47270  33755  24904  46989  23756  20673  29348
     256   30341  24404  42477  31936  24817  42498  21201  19791  26384
L3   512   25170  23307  30347  26145  23920  30347  15217  15231  17256
    1024   25136  23215  30197  26016  23823  30161  15206  15179  17216
    2048   25110  23257  30095  26043  23853  30095  15157  15126  17309
    4096   25127  23216  30017  25973  23768  29979  15137  15129  17234
    8192   25030  23284  29765  25944  23858  29768  14975  14987  17024
   16384   15624  15696  15474  15835  15948  15474   7769   7710   7660
   32768   14474  14578  14423  14670  14782  14417   7283   7270   7223
   65536   14734  14893  14683  14976  15047  14683   7401   7387   7342
R 131072   15054  15204  14955  15262  15376  14952   7519   7513   7448
  262144   15224  15373  15032  15451  15560  15088   7599   7589   7525
  524288   15312  15433  15163  15486  15638  15164   7631   7628   7558
 1048576   15295  15459  15220  15562  15691  15202   7648   7644   7575
 2097152   15374  15526  15231  15587  15723  15225   7655   7653   7583
 4194304   15393  15544  15241  15588  15716  15232   7670   7660   7592
R=RAM
    
Go To Start

MemSpeed Comparisons Next Page


MemSpeed Comparisons

Here, maximum MFLOPS are provided for the first floating point tests, by dividing MB/second by 8 for double precision (DP) and by 4 for single precision (SP).

With 64 bit operation, SSE SIMD mulp and addp instructions are used for SP, with 4 words in 128 bit xmm registers, using SSE2 for DP, with 2 words in the registers. These provide up to 4 or 2 simultaneous calculations respectively, at least providing significant SP performance gains.

AVX 1 ymm registers have 256 bits , vmulp and vaddp instructions being compiled, potentially doubling SSE and SSE2 speeds, with 4 DP and 8 SP words. In this case, the compiler appears to have further unrolled the SP calculation loop from 4 to 8 x[] and y[] addresses. For the i7 results shown, AVX SP MFLOPS increased by more than 3.2 times.

Maximum Integer MIPS are not shown, but for the 32 bytes (8 words) read, assembly code instructions used were 12 at 32 bits and 7 at both 64 bits and AVX, where MB/second can be divided by 2.67 or 4.57.

Intel Core i7 3900 MHz

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

64 bit
L1     4   35055  24094  50991  35054  24063  50855  28639  19434  28857
L2    64   31928  24488  47659  33712  24869  47557  23979  20652  29566
L3   512   25170  23307  30347  26145  23920  30347  15217  15231  17256
RAM  1GB   15295  15459  15220  15562  15691  15202   7648   7644   7575
Max MFLOPS  4382   6024
 
32 bit
L1     4   34692  17689  17790  35225  17813  17729  28530  14966  14985
L2    64   31050  17302  17833  32912  17797  17826  23586  13465  13439
L3   512   25186  17162  17689  26020  17597  17689  15493  11042  11007
RAM  1GB   14925  13795  14008  15043  13962  14000   7429   7596   7596
Max MFLOPS  4337   4422

64 Bit AVX
L1     4   59966  56645  52617  57685  56546  57879  37587  37408  37239
L2    64   48203  40892  48473  49140  49052  48634  30648  30484  30511
L3   512   32499  31113  33002  33332  33388  33001  19044  19007  19036
RAM  1GB   14737  14973  14564  14552  14545  14585   7365   7363   7358
Max MFLOPS  7496  14161


AMD Phenom II 3000 MHz

64 bit
L1     4   25660  18401  30360  23110  20146  30376  22347  15111  15262
      64   27372  19184  31675  24405  21049  31675  23940  15971  15932
     512   17329  16351  20723  17300  16766  20722  10540  10616  10489
RAM  1GB    6386   6105   6420   6382   6282   6387   3250   3296   3230
Max MFLOPS  3208   4600

32 bit
L1     4   21508  11080  11610  22942  11245  11576  12599  11693  11606
      64   22600  11317  11662  24020  11454  11616  12085  11974  11975
     512   14511   9152   9410  14641   9496   9407   7824   6723   6712
RAM  1GB    6569   5962   6254   6325   5991   6259   3414   3223   3244
Max MFLOPS  2689   2770

64 Bit AVX  Illegal instruction (no AVX instructions)


Intel Core 2 Duo 2400 MHz

64 bit
L1     4   15901  12391  10680  17787  12440  10680  18827   9222   6212
      64   12237  11368  10424  12240  11049  10433   7883   7920   6380
     512   12261  11379  10445  12262  11053  10444   7848   7905   6392
RAM 0.5GB   3421   3420   3387   3454   3426   3395   1788   1731   1761
Max MFLOPS  1988   3098 

32 bit
L1     4   17321   8606   9464  19039   9441   9463  18888   9279   9276
      64   11722   7628   7986  11344   7607   7988   8001   5369   5358
     512    7987   5078   5326   7569   5102   5320   5363   3582   3576
RAM 0.5GB   3441   3356   3389   3621   3357   3388   1787   1727   1736
Max MFLOPS  2165   2052

64 Bit AVX  Illegal instruction (no AVX instructions)
    


Go To Start


BusSpeed - busspeed32, busspeed64

Unlike Windows BusSpeed Benchmark, this one mainly uses up to 64 C AND statements, instead of assembly code. The exception is the 128bSSE2 test that comprises 64 pand assembly instructions. For this benchmark, the 64 bit version uses 64 bit integers. The benchmarks and source code are in memory_benchmarks.tar.gz.

Maximum MIPS speeds are provided for reading all data into integer and SSE type registers. As with The Windows Benchmark, RAM speeds from these single core tests are nowhere near the specification, and multiple cores need to be use to approach this.

Intel Core i7 3900 MHz

Bus Speed Test 64 bit Version 2.0 Thu Sep 28 17:16:54 2017 #64 bit Integers

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

L1      6    31239    31266    31259    42217    38193    42603    61440
       24    31291    31277    31274    41804    39352    42873    62262
L2     96    12301    12620    12598    21826    30551    39152    57284
      384     5508     5565     5700    11234    20684    34185    42054
      768     5304     5392     5504    10811    19331    33475    38119
L3   1536     5273     5371     5503    10815    19411    33663    38265
RAM 16380     1284     1566     2174     4770     9130    18560    19152
   131070     1225     1486     2099     4549     8741    18116    18376
   393210     1225     1486     2098     4548     8738    18136    18354
Max MIPS                                                   5325     7680#

32 bit
        6    15320    15478    20506    18275    20314    21357    60664
       96     7476     7624    11483    16509    20036    21095    60585
     1536     2692     2763     5393     9657    16704    21006    38151
   393210      744     1049     2252     4365     9073    16411    18322
Max MIPS                                                   5339    15166
 
AMD Phenom II 3000 MHz

64 bit
   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    21389    22715    26255    27025    27023    26405    23754
       96     2988     2970     2989     5984    11764    20688    23792
     1536     1294     1294     1294     2584     5176    10227    10350
   393210      930      933      939     1856     3840     6620     7695
Max MIPS                                                   3301     2969#

32 bit
        6    11117    12846    13522    13682    13464    13337    23747
       96     1494     1496     2984     5879    10598    12484    23877
     1536      646      646     1291     2586     5108     9128    10334
   393210      464      470      931     1917     3295     5497     7605
Max MIPS                                                   3334     5937

Intel Core 2 Duo 2400 MHz

64 bit
        6    10623    11662    12091    12350    12469    12763    24901
       96     2768     2768     2732     4471     6076     8959    12758
     1536     2783     2782     2742     4472     6079     8952    12804
   393210      640      641      745     1499     2619     4915     5122
Max MIPS                                                   1595     3113#

32 bit
        6     8568     9064     9171     9314     9405     9429    24883
       96     1383     1366     2182     3038     4474     5394    12728
     1536     1470     1359     2176     3032     4479     5396    12785
   393210      321      372      747     1317     2467     4273     5110
Max MIPS                                                   2357     6221
   


Go To Start


RandMem - randmem32, randmem64

These benchmarks are compiled with identical C code calculations as the Windows version. They use the same format of complex integer based indexing for serial and random reading and writing, with final data transferred being either 32 bit integer or 64 bit double precision floating point numbers. The benchmarks and source code are in memory_benchmarks.tar.gz.

Measured MB/second speeds are often effectively the same as the Windows version and between 32 bit and 64 bit compilations.

Intel Core i7 3900 MHz

   Random/Serial Memory Test 64 Bit Version 2 Thu Sep 28 17:19:24 2017

         Integer.......................  Double/Integer................
         Serial........  Random........  Serial........  Random........
    RAM   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt
     KB  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec

L1    6   26897   28366   26501   25786   30253   43467   30413   43903
     12   26968   28832   26380   27136   29908   43492   29908   43009
     24   27070   29211   26533   28203   29833   43659   29846   42796
     48   23201   23715   18765   12802   29685   33930   29673   30590
L2   96   23236   23770   13926    8923   29764   34102   22945   14939
    192   22993   21952    9837    6748   29257   32009   18268   12080
L3  384   22393   18694    8006    5842   28141   25442   14411    9830
    768   22292   18045    6049    4976   27842   23271   10213    8064
   1536   22322   18046    5414    4581   27854   23306    8808    7312
   3072   21970   17466    3246    3144   27501   23175    8219    6897
   6144   22439   18121    5052    4282   27419   23172    3477    3282
R 12288   15218   12304    2523    2694   20306   16075    4143    4373
  24576   13920   11179    1333    1338   18171   13831    2317    2394
  49152   14004   11264    1071    1062   17806   13683    1756    1784
  98304   14083   11308     973     865   18624   13836    1589    1559
 196608   14060   11234     930     685   18625   13840    1495    1177
 393216   14073   11325     910     624   18609   13836    1452     992
 786432   14093   11366     901     603   18567   13695    1433     935
1572864   13966   11357     892     614   18651   13840    1422     923
R=RAM

32 bit
      6   24707   28684   24260   28013   29137   42429   29249   43078
     96   22416   23796   13381    8764   29711   33703   23322   14583
   3072   21188   17452    3239    3145   26582   23052    8270    6913
  98304   13835   11296     970     899   18504   13735    1590    1551

AMD Phenom II 3000 MHz

64 bit
      6   12535    9131   12630    9061   16804   13612   16819   13611
     96   11973    8457    6859    5215   16923   11770   16495   11879
   3072    6118    7092    1202    1165    9527    9192    2045    2040
  98304    4367    3628     639     586    7131    5904    1077     959

32 bit
      6   13435   11393   12822   11139   16736   20256   16787   19821
     96   11481   10024    6903    5507   16967   16169   16542   14603
   3072    7718    7388    1072    1047    9433    9043    2048    2043
  98304    4423    3653     651     594    7081    5744    1079     944

Intel Core 2 Duo 2400 MHz

64 bit
      6    9150   12202    9152    5156   13712   16195   13714   15619
     96    8010    9497    4112    3702   11341   11886    7376    6420
   3072    7799    9287    2835    2598   10811   10442    3725    3357
  98304    3337    2345     471     345    4671    2766     711     561

32 bit
      6    8586   12171    8576    6574   13635   18131   13634   18092
     96    7620    9425    4015    3735   11355   12085    7371    6441
   3072    5050    6122    1931    1784    7303    6878    2521    2232
  98304    3858    2056     436     334    4990    2763     706     560
   


Go To Start


SSEfpu - ssefpu32, ssefpu64

This is a variation of the SSE3DNow Benchmark, with extensions but excluding AMD 3DNow tests. The benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and SSE2 (DP) assembly code instructions, along with compiled C code that produces the old x87 instructions at 32 bits and SSE type for working on a 64 bit system. The additional tests avoid intermediate register to register operations using s=(s+x[m])*y[m] and s=s+x[m]+y[m], to produce much faster speeds. The former leads to linked multiply and add operation that can produce up to eight floating point operations per clock cycle, or 31.2 GFLOPS on the Core i7 reported on below, with the appropriate test achieving up to a respectable 25 GFLOPS.

Note that results from 64 bit and 32 bit compilations can be virtually the same. This could be expected for SSE tests, as they use the same SSE assembly code instructions. Even the integer test results can be similar, with the 32 bit version compiled to use the old i87 floating point instructions and SSE instructions at 64 bits, but limited to scalar operation, dealing with only one of the four SSE register compartments. SSE performance is also similar to that from the Windows Benchmark, but results completely different for compiled integer tests (old compilation folder not available to investigate). The benchmarks and source code are in memory_benchmarks.tar.gz.

    Intel Core i7 3900 MHz

     SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.1     

  Memory    --s=s+x[m]*y[m]---   --x[m]=x[m]+y[m]-- (s+x[m])?y[m]
  KBytes    SSE2    SSE   Sngl   SSE2    SSE   Sngl  +*SSE  ++SSE
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

L1     4   41006  41014  10752  78678  75044  28425  93492  61309
       8   41329  41332  10585  78495  78680  27592  99656  61421
      16   41485  41485  10501  80823  80935  27681 100245  60960
      32   41562  41550  10459  81545  81550  27726  93422  60961
L2    64   41482  41442  10437  50270  50047  27208  56854  57013
     128   41516  41524  10428  49254  49178  27219  56004  56140
     256   40293  40326  10423  46558  46549  26748  48312  48513
L3   512   37261  37298  10418  32513  32531  24421  39719  39780
    1024   36790  36813  10414  31430  31425  24132  38698  38793
    2048   36880  36904  10418  31394  31400  24202  38839  38906
    4096   36931  36929  10415  31399  31381  24271  38891  38958
    8192   36791  36873  10416  31254  31306  24281  38765  38790
   16384   21227  21228   9540  15121  15124  15659  20817  20834
   32768   21407  21377   9560  14777  14762  15431  20967  20951
   65536   21831  21843   9576  14980  14981  15592  21380  21383
R 131072   22093  22104   9585  14980  14985  15600  21611  21649
  262144   22310  22297   9586  14986  15037  15675  21782  21792
  524288   22431  22530   9581  15054  15039  15682  21932  21931
 1048576   22591  22604   9590  15040  15055  15692  22035  22026
 2097152   22629  22634   9587  15059  15062  15700  22120  22108
 4194304   21864  21868   9573  14881  14873  15461  21372  21407
R=RAM
            SSE2    SSE   Norm   SSE2    SSE   Norm    SSE    SSE
 Maximum      DP     SP     SP     DP     SP     SP     SP     SP
  MFLOPS    5195  10388   2688   5097  10194   3553  25061  15355

32 bit
L1     4   40984  41012  10755  79355  79200  21372  90456  61235
      64   41499  41546  10440  49718  49820  18195  57058  57309
     512   35927  35840  10415  30957  30986  16915  37994  38140
RAM  1GB   20978  20953  10081  14733  14734  12392  20633  20624

AMD Phenom II 3000 MHz

64 bit
L1     4   22720  22649   6141  43355  43377  23298  66228  41175
      64   23878  23878   6017  44716  45514  23782  85916  46752
     512   20095  20048   6000  18630  18629  16662  20036  20018
RAM  1GB    8163   8260   5395   6754   6794   6757   8046   7939

32 bit
L1     4   22723  22686   6128  43666  41471  11794  66231  41868
      64   23841  23864   6018  42659  39727  11638  86456  46784
     512   17425  17335   5991  16456  16441   9528  17529  17536
RAM  1GB    8511   8519   5484   6921   6915   6199   8295   8256

Intel Core 2 Duo 2400 MHz

64 bit
L1     4   25197  25195   6601  36943  36943  13349  34725  34993
      64   18093  18606   6400  17062  17062  12685  19620  19639
     512   18343  18736   6396  17125  17128  12703  19793  19809
RAM 0.5GB   5712   5756   3951   3628   3501   3391   5676   5731

32 bit
L1     4   25193  25195   6603  37082  37081   9869  35725  35222
      64   11904  11846   4261  11227  11228   5039  12454  12540
     512   11927  11887   4261  11261  11261   5071  12586  12446
RAM 0.5GB   5727   5741   3956   3471   3499   3310   5668   5704

Go To Start


FFT Benchmarks - FFT1, FFT3c

The benchmarks and source code are in fftgraf.zip, that also contains benchmarks using the same format for Windows, Raspberry Pi and Android. An example of logged results is below and these can be compared with Windows Results above, for the same Core i7 system that produced slightly different performance, but identical numerical checks. As a reminder these benchmarks are all C code, with FFT1, being the original program and FFT3c, the third optimised one with rearranged C statements, instead of assembly code. Detailed results are below, along with comparisons that demonstrate performance gains of single vs double precision, 64 vs 32 bit and FFT3c vs FFT1. These demonstrate the variability of gains between different processors and FFT sizes.
FFT 64 Bit Benchmark Version 3c.0 Thu Sep 3 10:32:25 2015 Size milliseconds K Single Precision Double Precision 1 0.019 0.012 0.012 0.016 0.016 0.016 2 0.029 0.026 0.026 0.035 0.035 0.035 4 0.063 0.058 0.058 0.079 0.079 0.079 8 0.147 0.136 0.136 0.177 0.176 0.176 16 0.333 0.315 0.314 0.365 0.364 0.364 32 0.710 0.683 0.687 0.783 0.785 0.784 64 1.521 1.467 1.469 1.696 1.699 1.693 128 3.285 3.186 3.181 3.639 3.633 3.637 256 7.303 6.950 6.947 8.140 8.088 8.145 512 15.859 15.442 15.437 21.008 21.054 21.187 1024 38.551 37.789 37.776 65.300 65.009 65.388 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28
Cache FFT Size K ---> Results in milliseconds Processor MHz & RAM 1 2 4 8 16 32 64 128 256 512 1024 FFT1 SP 64 bit Core 2 Duo 2400 3AF DC4 0.037 0.09 0.22 0.64 1.48 3.4 7.7 17.0 37 96 587 Phenom 3000 4ZF DC8 0.026 0.06 0.14 0.34 1.55 4.0 9.8 27.6 65 151 549 Core i7 4820K 3900 3VF QC9 0.014 0.03 0.07 0.22 0.56 1.4 3.9 9.1 21 49 111 FFT1 DP 64 bit Core 2 Duo 2400 3AF DC4 0.044 0.11 0.30 0.69 1.57 3.5 7.6 16.5 47 317 763 Phenom 3000 4ZF DC8 0.031 0.07 0.17 0.80 2.04 5.0 14.0 32.7 76 283 712 Core i7 4820K 3900 3VF QC9 0.016 0.04 0.11 0.27 0.67 1.9 4.5 10.6 24 55 234 FFT3c SP 64 bit Core 2 Duo 2400 3AF DC4 0.029 0.07 0.17 0.41 0.92 2.0 4.3 9.4 21 52 141 Phenom 3000 4ZF DC8 0.021 0.05 0.10 0.26 0.66 1.6 4.0 9.3 21 53 153 Core i7 4820K 3900 3VF QC9 0.012 0.03 0.06 0.14 0.31 0.7 1.5 3.2 6.9 15 38 FFT3c DP 64 bit Core 2 Duo 2400 3AF DC4 0.054 0.12 0.29 0.63 1.28 2.8 6.1 14.1 34 85 195 Phenom 3000 4ZF DC8 0.026 0.05 0.14 0.40 0.87 2.1 4.6 10.7 27 78 96 Core i7 4820K 3900 3VF QC9 0.016 0.04 0.08 0.18 0.36 0.8 1.7 3.6 8.1 21 65 FFT1 SP 32 bit Core 2 Duo 2400 3AF DC4 0.038 0.09 0.23 0.65 1.56 3.6 8.2 18.1 39 108 441 Phenom 3000 4ZF DC8 0.029 0.07 0.19 0.35 1.59 4.0 9.8 27.7 65 150 535 Core i7 4820K 3900 3VF QC9 0.018 0.04 0.09 0.26 0.64 1.6 4.5 10.3 23 53 118 FFT1 DP 32 bit Core 2 Duo 2400 3AF DC4 0.043 0.11 0.30 0.73 1.68 3.8 8.6 19.0 58 247 624 Phenom 3000 4ZF DC8 0.029 0.09 0.24 0.81 2.04 4.9 13.9 32.3 75 282 711 Core i7 4820K 3900 3VF QC9 0.018 0.04 0.12 0.30 0.74 2.2 5.0 11.4 26 60 296 FFT3c SP 32 bit Core 2 Duo 2400 3AF DC4 0.033 0.08 0.18 0.43 0.95 2.1 4.6 10.0 23 54 127 Phenom 3000 4ZF DC8 0.028 0.06 0.13 0.31 0.77 1.8 4.4 10.0 23 55 157 Core i7 4820K 3900 3VF QC9 0.015 0.03 0.07 0.17 0.38 0.8 1.8 3.9 8 19 46 FFT3c DP 32 bit Core 2 Duo 2400 3AF DC4 0.034 0.08 0.19 0.42 1.04 2.3 4.8 11.3 28 69 155 Phenom 3000 4ZF DC8 0.026 0.05 0.14 0.40 0.95 2.2 4.9 11.4 29 81 207 Core i7 4820K 3900 3VF QC9 0.015 0.03 0.08 0.17 0.43 0.9 2.0 4.2 10 25 78 Performance Gains FFT Size K ---> 1 2 4 8 16 32 64 128 256 512 1024 64 bit SP/DP Core 2 Duo 1.86 1.77 1.71 1.54 1.39 1.43 1.43 1.50 1.60 1.64 1.39 Phenom 1.24 1.20 1.42 1.55 1.32 1.31 1.15 1.15 1.26 1.48 0.62 Core i7 4820K 1.33 1.35 1.36 1.29 1.16 1.15 1.15 1.14 1.16 1.36 1.72 64 bit/32 bit SP Core 2 Duo 1.14 1.09 1.08 1.05 1.03 1.06 1.06 1.06 1.06 1.04 0.90 Phenom 1.33 1.33 1.32 1.19 1.17 1.14 1.09 1.08 1.06 1.04 1.02 Core i7 4820K 1.25 1.27 1.24 1.21 1.22 1.22 1.23 1.22 1.20 1.22 1.20 64 bit SP FFT3/FFT1 Core 2 Duo 1.28 1.30 1.33 1.56 1.61 1.72 1.77 1.80 1.72 1.84 4.17 Phenom 1.24 1.36 1.46 1.31 2.34 2.51 2.45 2.97 3.02 2.84 3.58 Core i7 4820K 1.17 1.23 1.28 1.63 1.79 2.04 2.68 2.87 3.09 3.20 2.95
Go To Start